Skip to content

Tail Latency & p99

A team reports “our average latency is 40ms.” Everyone nods. The average is fine. Meanwhile, a meaningful slice of users are waiting 800ms and quietly leaving. The average didn’t lie about the arithmetic — it lied about the experience. Averages hide the tail, and the tail is where pain lives. This page is about reading latency honestly, and about the brutal way fan-out amplifies rare slowness into common slowness.

Latency distributions are not bell curves. They are heavily right-skewed: a dense cluster of fast responses and a long tail of slow ones caused by GC pauses, lock contention, cache misses, retries, or a node that’s quietly degrading. The mean gets dragged around by that tail without describing anyone’s actual experience.

count
│ ██
│ ████
│ ██████
│ ████████
│ ██████████▄▄▂▁▁______________ ← long tail
└──────────────────────────────────── latency
↑mean is somewhere in here, describing no one

The honest tool is the percentile. p99 = 250ms means “99% of requests finished in 250ms or less; 1% took longer.” Percentiles describe the worst case a given fraction of users actually hit:

PercentileReads asWho it represents
p50 (median)typical requestthe middle user
p95mild bad day1 in 20 requests
p99the tail1 in 100 — your most active users
p999the extreme tail1 in 1000 — but they may be your best customers

Here is the result that surprises everyone. Suppose a request fans out to many backend services (a search query hitting 100 shards, a page assembled from 50 microservices) and must wait for the slowest one. Even if each backend has a great p99, the combined p99 is terrible.

If one service responds slowly 1% of the time, then a request touching 100 of them waits on a slow backend with probability 1 − 0.99¹⁰⁰ ≈ 63%. A 1-in-100 event at the leaf becomes a near-coin-flip at the root. Your rare tail at the component level is the common case at the user level.

┌─ shard 1 (fast)
request ┼─ shard 2 (fast) latency = MAX of all branches
(root) ┼─ ... one slow branch ⇒ whole request slow
└─ shard 100 (SLOW ←)

This is why tail latency is the central problem of large distributed systems: scaling out adds backends, and every added backend is another lottery ticket for hitting the tail. Reducing average latency does nothing here; you must attack the tail directly.

Send the request to one replica. If it hasn’t answered by the p95 mark, send a second copy to a different replica and take whichever returns first; cancel the loser. Because slowness is usually a property of a transient state (a node mid-GC, a cold cache) rather than the request itself, the backup almost always wins the race. The cost: a few percent extra load (you only hedge the slow tail, not every request), in exchange for collapsing p99 toward p50. This is the highest-leverage tail technique known.

A request that’s already past p999 is rarely worth waiting for — it’s more likely a doomed call holding a connection hostage. Set timeouts near your tail target, fail fast, and return a degraded-but-complete response rather than a perfect-but-late one. The trade-off is occasional incomplete results; the alternative is one slow leaf hanging the whole page.

Many tail spikes are self-inflicted: synchronous GC, unbounded queues (see Backpressure & Flow Control), retry storms, and hot partitions (the celebrity problem) all manufacture tail latency. Often the best fix isn’t a clever race — it’s removing the source. Start by finding the bottleneck that’s generating the slow responses.

For fan-out reads, don’t wait for the slowest branch — answer with what you have. A search that returns 98 of 100 shards in 80ms beats one that returns all 100 in 800ms. Completeness becomes a tunable, not a hard requirement.

What does this buy us, and what does it cost?

Section titled “What does this buy us, and what does it cost?”

Optimizing for the tail buys you a system that feels fast to your heaviest users — the ones whose request volume guarantees they sample your worst case. The cost is real: hedging spends extra capacity, tight timeouts surrender completeness, and partial results complicate correctness. You are trading some throughput and some exactness for predictability. For interactive systems, that trade is almost always right — users forgive “fast and slightly incomplete” far sooner than “complete and slow.”

  1. Why is the average latency a misleading metric for a right-skewed distribution?
  2. A backend has a 1% chance of being slow. Why does a request fanning out to 100 of them hit the slow case ~63% of the time?
  3. Why is the p99 latency effectively the typical experience of a power user?
  4. Explain why hedged requests can collapse p99 toward p50 while adding only a few percent load.
  5. When would you deliberately return a partial result instead of waiting for every backend?