Design a Rate Limiter
A rate limiter caps how many requests a client may make in a window (“100 requests/minute”). It’s a deceptively rich design problem: the single-machine version is a few lines, but doing it correctly across a fleet of servers forces you to confront shared state, latency, and accuracy trade-offs. (For the concept, see Rate Limiting; here we design one.)
1. Requirements
Section titled “1. Requirements”Functional
- Limit requests per client (by user ID, API key, or IP) to N per time window.
- Return
429 Too Many Requests(with aRetry-Afterheader) when exceeded. - Limits must apply across the whole fleet, not per server.
Non-functional
- Low latency — the limiter is on every request’s hot path, so it must add single-digit milliseconds at most.
- Highly available — if the limiter is down, fail open (allow) or closed (block)? A deliberate choice (see below).
- Memory-efficient at scale (millions of clients).
2. Estimation
Section titled “2. Estimation”Say 1M active clients and 10k requests/second globally. The limiter does at least one read+write of counter state per request → ~10k ops/s against the counter store, and must hold ~1M small counters. That comfortably fits an in-memory store like Redis — which points us at the architecture.
3. Algorithm choice
Section titled “3. Algorithm choice”The heart of the design. Each trades accuracy for simplicity/cost:
| Algorithm | Idea | Pro | Con |
|---|---|---|---|
| Fixed window | Count per clock-aligned window | Trivial, 1 counter | Burst at edges: 2× limit across a boundary |
| Sliding window log | Store timestamp of every request | Exact | Memory-heavy (one entry per request) |
| Sliding window counter | Weighted blend of current + previous window | Smooth, cheap | Slight approximation |
| Token bucket | Tokens refill at a steady rate; each request spends one | Allows controlled bursts | Two values to track (tokens, last-refill) |
| Leaky bucket | Requests queue and drain at a fixed rate | Smooths output | Adds queuing latency |
Token bucket is the common default: it permits short bursts (good UX) while enforcing a long-run
rate, and needs just (tokens, last_refill_time) per client.
tokens refill at R/sec, capacity = Brequest arrives → refill tokens based on elapsed time → if tokens ≥ 1: allow, tokens-- ; else: 4294. The distributed problem
Section titled “4. The distributed problem”A per-server counter is wrong: with 10 servers each allowing 100/min, a client gets 1,000/min. The limit must live in shared state.
request ──► any app server ──► [ Redis: client:123 → {tokens, ts} ] ──► allow / 429 (stateless) single source of truth, fleet-wideThe update must be atomic — read-modify-write across many servers races. Use Redis with an atomic
Lua script (or INCR + EXPIRE for window counters) so the check-and-decrement happens in one
indivisible operation.
5. Deeper trade-offs
Section titled “5. Deeper trade-offs”- Where to enforce: an API gateway is the natural home — limit once at the edge before requests fan into the system.
- Latency vs accuracy: a strictly-correct global counter means a network hop to Redis per request. For extreme throughput, allow a small local token cache per server (approximate, occasionally over-permits) and reconcile with the central store — trading perfect accuracy for speed.
- Fail open or closed? If Redis is unreachable, failing open (allow) preserves availability but removes protection; failing closed (block) protects the backend but causes an outage. Most user-facing APIs fail open for availability; abuse-sensitive endpoints fail closed.
The thread
Section titled “The thread”What does this buy us, and what does it cost? A rate limiter buys protection (against overload, abuse, runaway clients) and fairness — at the cost of latency on every request and a shared-state dependency. The central design tension is accuracy vs speed: perfectly global counting is a hop away on every call; cheap local approximations are fast but leak a little capacity. Pick the point that matches whether you’re protecting a fragile backend or just shaping traffic.
Check your understanding
Section titled “Check your understanding”- Why is a per-server counter incorrect, and where must the limit’s state live?
- Why is token bucket often preferred over fixed window?
- Describe the race condition in a naïve GET-then-SET limiter and how atomicity fixes it.
- Explain the accuracy-vs-latency trade-off of a per-server local token cache.
- When would you fail open vs fail closed if the counter store is unreachable?