Skip to content

Design a Typeahead / Autocomplete

You type “new yo” into a search box and, before your next keystroke, a list appears: new york, new york times, new york weather. That list arrived in under ~50 ms, was chosen from billions of possible queries, and is ranked so the thing most people want sits on top. Typeahead is a deceptively deep system: it’s a read-heavy, latency-obsessed ranking problem where being a little wrong but fast beats being perfect and slow. This page builds it from the data structure up.

Functional

  • Given a prefix, return the top k (usually 5–10) completions.
  • Rank by popularity (and ideally personalization/recency), not alphabetically.
  • Update suggestions as the corpus shifts (trending terms, new entities).
  • Tolerate typos lightly is a nice-to-have; exact-prefix is the core.

Non-functional

  • Latency is the product. End-to-end p99 under ~100 ms; the serving tier under ~10 ms.
  • Massively read-heavy. Reads dwarf writes by orders of magnitude.
  • High availability — a search box that hangs feels broken even if search itself works.
  • Eventual freshness is fine. A term trending 10 minutes ago appearing now is acceptable.

Assume 10M searches/day with ~20 keystrokes each that fire a suggestion lookup → 200M lookups/day2,300/sec average, with peaks 10×. Because reads dominate and the answer for “new yo” is identical for millions of users, this is an ideal caching workload — a small set of hot prefixes serves most traffic.

Corpus: ~100M distinct historical queries. A trie over them with shared prefixes fits comfortably in the tens of GB — small enough to hold in memory, which is the whole game for sub-10 ms serving.

GET /v1/suggest?q=new+yo&limit=10&lang=en
→ 200
{
"prefix": "new yo",
"suggestions": [
{ "text": "new york", "score": 0.91 },
{ "text": "new york times", "score": 0.74 },
{ "text": "new york weather","score": 0.55 }
]
}

One tiny GET, called on nearly every keystroke. It must be cacheable end-to-end (CDN-friendly, Cache-Control short-lived) because the same prefix is asked millions of times.

Two representations, built offline and queried online:

Option A — TRIE (prefix tree), top-k cached at each node
root
└─ n ─ e ─ w ─ (space) ─ y ─ o
│ └─ [new york, new york times, ...] ← precomputed top-k
Each node stores the best k completions of the prefix ending there.
Option B — PREFIX INDEX (flat key→value)
"new yo" → [new york, new york times, ...]
"new yor"→ [new york, ...]
A hash/sorted store keyed by every prefix, value = precomputed top-k list.

The aggregation pipeline (query logs → counts → ranking) runs offline and writes the precomputed top-k. Serving never sorts at request time — it does a lookup and returns.

browser
│ debounced keystrokes (q="new yo")
CDN / edge cache ──hit──► returns top-k (most traffic stops here)
│ miss
Suggestion service (stateless) ──► in-memory hot-prefix cache ──hit──► return
│ miss
Serving store (trie / prefix index, in RAM, sharded by prefix)
│ periodic rebuild
Aggregation pipeline: query logs → count + decay → rank → top-k per prefix

Writes and reads are fully separated. The pipeline crunches logs every few minutes and ships a fresh, immutable index to the serving tier; the serving tier only ever reads. That separation is what lets reads be blindingly fast — they never contend with ranking computation.

What does this buy us, and what does it cost? Precomputing top-k per prefix turns every query into an O(1)-ish lookup — that’s the latency win. The cost is freshness lag (the index is as old as the last rebuild) and storage amplification (you store top-k for every prefix, not just whole queries). We pay memory and staleness to buy speed, which is exactly the trade typeahead wants.

Trie vs prefix index. A trie shares common prefixes, so it’s memory-efficient and naturally supports “extend the prefix by one char.” A flat prefix index is dead simple to shard and cache (it’s just key→value) but stores redundant lists. In practice large systems lean on the flat, precomputed, cached form because it maps perfectly onto a CDN/KV cache — and caching is where the real latency comes from. See Caching Strategies.

Caching hot prefixes. Prefix popularity follows a brutal power law: a few thousand prefixes (“a”, “f”, “new”, “you”) cover the vast majority of lookups. Cache those aggressively at the edge and in service memory, with a short TTL so trending terms still flow in.

Cache layers, fastest → slowest:
CDN edge (ms, geo-near user, covers head prefixes)
service memory (sub-ms, covers head + warm prefixes)
serving store (single-digit ms, covers the long tail)

Debouncing (client-side). Firing a request on every keystroke wastes the network and races responses out of order. The client debounces (~100–150 ms of quiet) and cancels stale in-flight requests, so a fast typer sends one lookup for “new york” instead of seven. This single client trick removes most of the load before it ever reaches your servers.

Ranking. Score = popularity with time decay (so last year’s hit fades), plus optional personalization and a freshness boost for trending terms. Crucially, ranking happens offline in the pipeline; the serving tier just returns the precomputed order.

Sharding & freshness. Shard the serving store by prefix range or hash so the corpus spreads across machines. Rebuild incrementally — a streaming layer can nudge counts for trending terms between full rebuilds so “breaking news” surfaces in minutes, not hours.

Decision Buys you Costs you
─────────────────────────────────────────────────────────────────────
Precompute top-k O(1) reads, no storage blow-up,
per prefix request-time sort freshness lag
Flat prefix index trivial sharding + redundant storage vs trie
CDN/KV caching
Trie memory sharing, harder to shard/cache
incremental prefix
Edge caching head most traffic never stale head if TTL too long
prefixes hits your servers
Client debouncing ~5–7× fewer requests tiny added input latency
Offline ranking fast serving tier minutes-old rankings

The throughline: typeahead is read-path engineering. You move all the hard work (ranking, aggregation) offline and to write time, leaving the read path as little more than a cached lookup — because the only thing the user can feel is how fast that lookup returns.

  1. Why can typeahead tolerate stale and slightly-wrong results but not slow ones, and how does that shape every design choice?
  2. Contrast a trie with a flat prefix index — why do large systems often prefer the flat form despite its storage redundancy?
  3. What is client-side debouncing and roughly how much load does it remove before requests reach the server?
  4. Why are single-character prefixes both the hottest and the safest to cache with a long TTL?
  5. Where does ranking happen, and why is doing it offline essential to sub-10 ms serving?