Design a Typeahead / Autocomplete
You type “new yo” into a search box and, before your next keystroke, a list appears: new york, new york times, new york weather. That list arrived in under ~50 ms, was chosen from billions of possible queries, and is ranked so the thing most people want sits on top. Typeahead is a deceptively deep system: it’s a read-heavy, latency-obsessed ranking problem where being a little wrong but fast beats being perfect and slow. This page builds it from the data structure up.
1. Requirements
Section titled “1. Requirements”Functional
- Given a prefix, return the top k (usually 5–10) completions.
- Rank by popularity (and ideally personalization/recency), not alphabetically.
- Update suggestions as the corpus shifts (trending terms, new entities).
- Tolerate typos lightly is a nice-to-have; exact-prefix is the core.
Non-functional
- Latency is the product. End-to-end p99 under ~100 ms; the serving tier under ~10 ms.
- Massively read-heavy. Reads dwarf writes by orders of magnitude.
- High availability — a search box that hangs feels broken even if search itself works.
- Eventual freshness is fine. A term trending 10 minutes ago appearing now is acceptable.
2. Estimation
Section titled “2. Estimation”Assume 10M searches/day with ~20 keystrokes each that fire a suggestion lookup → 200M lookups/day ≈ 2,300/sec average, with peaks 10×. Because reads dominate and the answer for “new yo” is identical for millions of users, this is an ideal caching workload — a small set of hot prefixes serves most traffic.
Corpus: ~100M distinct historical queries. A trie over them with shared prefixes fits comfortably in the tens of GB — small enough to hold in memory, which is the whole game for sub-10 ms serving.
3. API sketch
Section titled “3. API sketch”GET /v1/suggest?q=new+yo&limit=10&lang=en→ 200{ "prefix": "new yo", "suggestions": [ { "text": "new york", "score": 0.91 }, { "text": "new york times", "score": 0.74 }, { "text": "new york weather","score": 0.55 } ]}One tiny GET, called on nearly every keystroke. It must be cacheable end-to-end (CDN-friendly,
Cache-Control short-lived) because the same prefix is asked millions of times.
4. Data model
Section titled “4. Data model”Two representations, built offline and queried online:
Option A — TRIE (prefix tree), top-k cached at each node root └─ n ─ e ─ w ─ (space) ─ y ─ o │ └─ [new york, new york times, ...] ← precomputed top-k Each node stores the best k completions of the prefix ending there.
Option B — PREFIX INDEX (flat key→value) "new yo" → [new york, new york times, ...] "new yor"→ [new york, ...] A hash/sorted store keyed by every prefix, value = precomputed top-k list.The aggregation pipeline (query logs → counts → ranking) runs offline and writes the precomputed top-k. Serving never sorts at request time — it does a lookup and returns.
5. High-level design
Section titled “5. High-level design” browser │ debounced keystrokes (q="new yo") ▼ CDN / edge cache ──hit──► returns top-k (most traffic stops here) │ miss ▼ Suggestion service (stateless) ──► in-memory hot-prefix cache ──hit──► return │ miss ▼ Serving store (trie / prefix index, in RAM, sharded by prefix) ▲ │ periodic rebuild Aggregation pipeline: query logs → count + decay → rank → top-k per prefixWrites and reads are fully separated. The pipeline crunches logs every few minutes and ships a fresh, immutable index to the serving tier; the serving tier only ever reads. That separation is what lets reads be blindingly fast — they never contend with ranking computation.
What does this buy us, and what does it cost? Precomputing top-k per prefix turns every query into an O(1)-ish lookup — that’s the latency win. The cost is freshness lag (the index is as old as the last rebuild) and storage amplification (you store top-k for every prefix, not just whole queries). We pay memory and staleness to buy speed, which is exactly the trade typeahead wants.
6. Scaling and deep dives
Section titled “6. Scaling and deep dives”Trie vs prefix index. A trie shares common prefixes, so it’s memory-efficient and naturally supports “extend the prefix by one char.” A flat prefix index is dead simple to shard and cache (it’s just key→value) but stores redundant lists. In practice large systems lean on the flat, precomputed, cached form because it maps perfectly onto a CDN/KV cache — and caching is where the real latency comes from. See Caching Strategies.
Caching hot prefixes. Prefix popularity follows a brutal power law: a few thousand prefixes (“a”, “f”, “new”, “you”) cover the vast majority of lookups. Cache those aggressively at the edge and in service memory, with a short TTL so trending terms still flow in.
Cache layers, fastest → slowest: CDN edge (ms, geo-near user, covers head prefixes) service memory (sub-ms, covers head + warm prefixes) serving store (single-digit ms, covers the long tail)Debouncing (client-side). Firing a request on every keystroke wastes the network and races responses out of order. The client debounces (~100–150 ms of quiet) and cancels stale in-flight requests, so a fast typer sends one lookup for “new york” instead of seven. This single client trick removes most of the load before it ever reaches your servers.
Ranking. Score = popularity with time decay (so last year’s hit fades), plus optional personalization and a freshness boost for trending terms. Crucially, ranking happens offline in the pipeline; the serving tier just returns the precomputed order.
Sharding & freshness. Shard the serving store by prefix range or hash so the corpus spreads across machines. Rebuild incrementally — a streaming layer can nudge counts for trending terms between full rebuilds so “breaking news” surfaces in minutes, not hours.
7. Key trade-offs
Section titled “7. Key trade-offs”Decision Buys you Costs you─────────────────────────────────────────────────────────────────────Precompute top-k O(1) reads, no storage blow-up,per prefix request-time sort freshness lagFlat prefix index trivial sharding + redundant storage vs trie CDN/KV cachingTrie memory sharing, harder to shard/cache incremental prefixEdge caching head most traffic never stale head if TTL too longprefixes hits your serversClient debouncing ~5–7× fewer requests tiny added input latencyOffline ranking fast serving tier minutes-old rankingsThe throughline: typeahead is read-path engineering. You move all the hard work (ranking, aggregation) offline and to write time, leaving the read path as little more than a cached lookup — because the only thing the user can feel is how fast that lookup returns.
Check your understanding
Section titled “Check your understanding”- Why can typeahead tolerate stale and slightly-wrong results but not slow ones, and how does that shape every design choice?
- Contrast a trie with a flat prefix index — why do large systems often prefer the flat form despite its storage redundancy?
- What is client-side debouncing and roughly how much load does it remove before requests reach the server?
- Why are single-character prefixes both the hottest and the safest to cache with a long TTL?
- Where does ranking happen, and why is doing it offline essential to sub-10 ms serving?