Design a Rate Limiter

A rate limiter caps how many requests a client may make in a window (“100 requests/minute”). It’s a deceptively rich design problem: the single-machine version is a few lines, but doing it correctly across a fleet of servers forces you to confront shared state, latency, and accuracy trade-offs. (For the concept, see Rate Limiting; here we design one.)

1. Requirements

Functional

Limit requests per client (by user ID, API key, or IP) to N per time window.
Return 429 Too Many Requests (with a Retry-After header) when exceeded.
Limits must apply across the whole fleet, not per server.

Non-functional

Low latency — the limiter is on every request’s hot path, so it must add single-digit milliseconds at most.
Highly available — if the limiter is down, fail open (allow) or closed (block)? A deliberate choice (see below).
Memory-efficient at scale (millions of clients).

2. Estimation

Say 1M active clients and 10k requests/second globally. The limiter does at least one read+write of counter state per request → ~10k ops/s against the counter store, and must hold ~1M small counters. That comfortably fits an in-memory store like Redis — which points us at the architecture.

3. Algorithm choice

The heart of the design. Each trades accuracy for simplicity/cost:

Algorithm	Idea	Pro	Con
Fixed window	Count per clock-aligned window	Trivial, 1 counter	Burst at edges: 2× limit across a boundary
Sliding window log	Store timestamp of every request	Exact	Memory-heavy (one entry per request)
Sliding window counter	Weighted blend of current + previous window	Smooth, cheap	Slight approximation
Token bucket	Tokens refill at a steady rate; each request spends one	Allows controlled bursts	Two values to track (tokens, last-refill)
Leaky bucket	Requests queue and drain at a fixed rate	Smooths output	Adds queuing latency

Token bucket is the common default: it permits short bursts (good UX) while enforcing a long-run rate, and needs just (tokens, last_refill_time) per client.

tokens refill at R/sec, capacity = B
request arrives → refill tokens based on elapsed time → if tokens ≥ 1: allow, tokens-- ; else: 429

4. The distributed problem

A per-server counter is wrong: with 10 servers each allowing 100/min, a client gets 1,000/min. The limit must live in shared state.

   request ──► any app server ──► [ Redis: client:123 → {tokens, ts} ] ──► allow / 429
                 (stateless)          single source of truth, fleet-wide

The update must be atomic — read-modify-write across many servers races. Use Redis with an atomic Lua script (or INCR + EXPIRE for window counters) so the check-and-decrement happens in one indivisible operation.

5. Deeper trade-offs

Where to enforce: an API gateway is the natural home — limit once at the edge before requests fan into the system.
Latency vs accuracy: a strictly-correct global counter means a network hop to Redis per request. For extreme throughput, allow a small local token cache per server (approximate, occasionally over-permits) and reconcile with the central store — trading perfect accuracy for speed.
Fail open or closed? If Redis is unreachable, failing open (allow) preserves availability but removes protection; failing closed (block) protects the backend but causes an outage. Most user-facing APIs fail open for availability; abuse-sensitive endpoints fail closed.

The thread

What does this buy us, and what does it cost? A rate limiter buys protection (against overload, abuse, runaway clients) and fairness — at the cost of latency on every request and a shared-state dependency. The central design tension is accuracy vs speed: perfectly global counting is a hop away on every call; cheap local approximations are fast but leak a little capacity. Pick the point that matches whether you’re protecting a fragile backend or just shaping traffic.

Check your understanding

Why is a per-server counter incorrect, and where must the limit’s state live?
Why is token bucket often preferred over fixed window?
Describe the race condition in a naïve GET-then-SET limiter and how atomicity fixes it.
Explain the accuracy-vs-latency trade-off of a per-server local token cache.
When would you fail open vs fail closed if the counter store is unreachable?