Design a Typeahead / Autocomplete

You type “new yo” into a search box and, before your next keystroke, a list appears: new york, new york times, new york weather. That list arrived in under ~50 ms, was chosen from billions of possible queries, and is ranked so the thing most people want sits on top. Typeahead is a deceptively deep system: it’s a read-heavy, latency-obsessed ranking problem where being a little wrong but fast beats being perfect and slow. This page builds it from the data structure up.

1. Requirements

Functional

Given a prefix, return the top k (usually 5–10) completions.
Rank by popularity (and ideally personalization/recency), not alphabetically.
Update suggestions as the corpus shifts (trending terms, new entities).
Tolerate typos lightly is a nice-to-have; exact-prefix is the core.

Non-functional

Latency is the product. End-to-end p99 under ~100 ms; the serving tier under ~10 ms.
Massively read-heavy. Reads dwarf writes by orders of magnitude.
High availability — a search box that hangs feels broken even if search itself works.
Eventual freshness is fine. A term trending 10 minutes ago appearing now is acceptable.

2. Estimation

Assume 10M searches/day with ~20 keystrokes each that fire a suggestion lookup → 200M lookups/day ≈ 2,300/sec average, with peaks 10×. Because reads dominate and the answer for “new yo” is identical for millions of users, this is an ideal caching workload — a small set of hot prefixes serves most traffic.

Corpus: ~100M distinct historical queries. A trie over them with shared prefixes fits comfortably in the tens of GB — small enough to hold in memory, which is the whole game for sub-10 ms serving.

3. API sketch

GET /v1/suggest?q=new+yo&limit=10&lang=en
→ 200
{
  "prefix": "new yo",
  "suggestions": [
    { "text": "new york",        "score": 0.91 },
    { "text": "new york times",  "score": 0.74 },
    { "text": "new york weather","score": 0.55 }
  ]
}

One tiny GET, called on nearly every keystroke. It must be cacheable end-to-end (CDN-friendly, Cache-Control short-lived) because the same prefix is asked millions of times.

4. Data model

Two representations, built offline and queried online:

Option A — TRIE (prefix tree), top-k cached at each node
   root
    └─ n ─ e ─ w ─ (space) ─ y ─ o
                              │   └─ [new york, new york times, ...]  ← precomputed top-k
   Each node stores the best k completions of the prefix ending there.

Option B — PREFIX INDEX (flat key→value)
   "new yo" → [new york, new york times, ...]
   "new yor"→ [new york, ...]
   A hash/sorted store keyed by every prefix, value = precomputed top-k list.

The aggregation pipeline (query logs → counts → ranking) runs offline and writes the precomputed top-k. Serving never sorts at request time — it does a lookup and returns.

5. High-level design

   browser
     │  debounced keystrokes (q="new yo")
     ▼
   CDN / edge cache ──hit──► returns top-k (most traffic stops here)
     │ miss
     ▼
   Suggestion service (stateless) ──► in-memory hot-prefix cache ──hit──► return
     │ miss
     ▼
   Serving store (trie / prefix index, in RAM, sharded by prefix)
     ▲
     │ periodic rebuild
   Aggregation pipeline:  query logs → count + decay → rank → top-k per prefix

Writes and reads are fully separated. The pipeline crunches logs every few minutes and ships a fresh, immutable index to the serving tier; the serving tier only ever reads. That separation is what lets reads be blindingly fast — they never contend with ranking computation.

What does this buy us, and what does it cost? Precomputing top-k per prefix turns every query into an O(1)-ish lookup — that’s the latency win. The cost is freshness lag (the index is as old as the last rebuild) and storage amplification (you store top-k for every prefix, not just whole queries). We pay memory and staleness to buy speed, which is exactly the trade typeahead wants.

6. Scaling and deep dives

Trie vs prefix index. A trie shares common prefixes, so it’s memory-efficient and naturally supports “extend the prefix by one char.” A flat prefix index is dead simple to shard and cache (it’s just key→value) but stores redundant lists. In practice large systems lean on the flat, precomputed, cached form because it maps perfectly onto a CDN/KV cache — and caching is where the real latency comes from. See Caching Strategies.

Caching hot prefixes. Prefix popularity follows a brutal power law: a few thousand prefixes (“a”, “f”, “new”, “you”) cover the vast majority of lookups. Cache those aggressively at the edge and in service memory, with a short TTL so trending terms still flow in.

Cache layers, fastest → slowest:
  CDN edge        (ms, geo-near user, covers head prefixes)
  service memory  (sub-ms, covers head + warm prefixes)
  serving store   (single-digit ms, covers the long tail)

Debouncing (client-side). Firing a request on every keystroke wastes the network and races responses out of order. The client debounces (~100–150 ms of quiet) and cancels stale in-flight requests, so a fast typer sends one lookup for “new york” instead of seven. This single client trick removes most of the load before it ever reaches your servers.

Ranking. Score = popularity with time decay (so last year’s hit fades), plus optional personalization and a freshness boost for trending terms. Crucially, ranking happens offline in the pipeline; the serving tier just returns the precomputed order.

Sharding & freshness. Shard the serving store by prefix range or hash so the corpus spreads across machines. Rebuild incrementally — a streaming layer can nudge counts for trending terms between full rebuilds so “breaking news” surfaces in minutes, not hours.

7. Key trade-offs

Decision                  Buys you                  Costs you
─────────────────────────────────────────────────────────────────────
Precompute top-k          O(1) reads, no            storage blow-up,
per prefix                request-time sort         freshness lag
Flat prefix index         trivial sharding +        redundant storage vs trie
                          CDN/KV caching
Trie                      memory sharing,           harder to shard/cache
                          incremental prefix
Edge caching head         most traffic never        stale head if TTL too long
prefixes                  hits your servers
Client debouncing         ~5–7× fewer requests      tiny added input latency
Offline ranking           fast serving tier         minutes-old rankings

The throughline: typeahead is read-path engineering. You move all the hard work (ranking, aggregation) offline and to write time, leaving the read path as little more than a cached lookup — because the only thing the user can feel is how fast that lookup returns.

Check your understanding

Why can typeahead tolerate stale and slightly-wrong results but not slow ones, and how does that shape every design choice?
Contrast a trie with a flat prefix index — why do large systems often prefer the flat form despite its storage redundancy?
What is client-side debouncing and roughly how much load does it remove before requests reach the server?
Why are single-character prefixes both the hottest and the safest to cache with a long TTL?
Where does ranking happen, and why is doing it offline essential to sub-10 ms serving?