Tail Latency & p99
A team reports “our average latency is 40ms.” Everyone nods. The average is fine. Meanwhile, a meaningful slice of users are waiting 800ms and quietly leaving. The average didn’t lie about the arithmetic — it lied about the experience. Averages hide the tail, and the tail is where pain lives. This page is about reading latency honestly, and about the brutal way fan-out amplifies rare slowness into common slowness.
Why the mean is the wrong number
Section titled “Why the mean is the wrong number”Latency distributions are not bell curves. They are heavily right-skewed: a dense cluster of fast responses and a long tail of slow ones caused by GC pauses, lock contention, cache misses, retries, or a node that’s quietly degrading. The mean gets dragged around by that tail without describing anyone’s actual experience.
count │ ██ │ ████ │ ██████ │ ████████ │ ██████████▄▄▂▁▁______________ ← long tail └──────────────────────────────────── latency ↑mean is somewhere in here, describing no oneThe honest tool is the percentile. p99 = 250ms means “99% of requests finished in 250ms or less; 1% took longer.” Percentiles describe the worst case a given fraction of users actually hit:
| Percentile | Reads as | Who it represents |
|---|---|---|
| p50 (median) | typical request | the middle user |
| p95 | mild bad day | 1 in 20 requests |
| p99 | the tail | 1 in 100 — your most active users |
| p999 | the extreme tail | 1 in 1000 — but they may be your best customers |
Fan-out: where the tail becomes the norm
Section titled “Fan-out: where the tail becomes the norm”Here is the result that surprises everyone. Suppose a request fans out to many backend services (a search query hitting 100 shards, a page assembled from 50 microservices) and must wait for the slowest one. Even if each backend has a great p99, the combined p99 is terrible.
If one service responds slowly 1% of the time, then a request touching 100 of them waits on a slow
backend with probability 1 − 0.99¹⁰⁰ ≈ 63%. A 1-in-100 event at the leaf becomes a near-coin-flip
at the root. Your rare tail at the component level is the common case at the user level.
┌─ shard 1 (fast)request ┼─ shard 2 (fast) latency = MAX of all branches (root) ┼─ ... one slow branch ⇒ whole request slow └─ shard 100 (SLOW ←)This is why tail latency is the central problem of large distributed systems: scaling out adds backends, and every added backend is another lottery ticket for hitting the tail. Reducing average latency does nothing here; you must attack the tail directly.
Mitigations: fighting the tail
Section titled “Mitigations: fighting the tail”Hedged requests
Section titled “Hedged requests”Send the request to one replica. If it hasn’t answered by the p95 mark, send a second copy to a different replica and take whichever returns first; cancel the loser. Because slowness is usually a property of a transient state (a node mid-GC, a cold cache) rather than the request itself, the backup almost always wins the race. The cost: a few percent extra load (you only hedge the slow tail, not every request), in exchange for collapsing p99 toward p50. This is the highest-leverage tail technique known.
Aggressive, tiered timeouts
Section titled “Aggressive, tiered timeouts”A request that’s already past p999 is rarely worth waiting for — it’s more likely a doomed call holding a connection hostage. Set timeouts near your tail target, fail fast, and return a degraded-but-complete response rather than a perfect-but-late one. The trade-off is occasional incomplete results; the alternative is one slow leaf hanging the whole page.
Shed the tail’s causes
Section titled “Shed the tail’s causes”Many tail spikes are self-inflicted: synchronous GC, unbounded queues (see Backpressure & Flow Control), retry storms, and hot partitions (the celebrity problem) all manufacture tail latency. Often the best fix isn’t a clever race — it’s removing the source. Start by finding the bottleneck that’s generating the slow responses.
Return partial results
Section titled “Return partial results”For fan-out reads, don’t wait for the slowest branch — answer with what you have. A search that returns 98 of 100 shards in 80ms beats one that returns all 100 in 800ms. Completeness becomes a tunable, not a hard requirement.
What does this buy us, and what does it cost?
Section titled “What does this buy us, and what does it cost?”Optimizing for the tail buys you a system that feels fast to your heaviest users — the ones whose request volume guarantees they sample your worst case. The cost is real: hedging spends extra capacity, tight timeouts surrender completeness, and partial results complicate correctness. You are trading some throughput and some exactness for predictability. For interactive systems, that trade is almost always right — users forgive “fast and slightly incomplete” far sooner than “complete and slow.”
Check your understanding
Section titled “Check your understanding”- Why is the average latency a misleading metric for a right-skewed distribution?
- A backend has a 1% chance of being slow. Why does a request fanning out to 100 of them hit the slow case ~63% of the time?
- Why is the p99 latency effectively the typical experience of a power user?
- Explain why hedged requests can collapse p99 toward p50 while adding only a few percent load.
- When would you deliberately return a partial result instead of waiting for every backend?