Rate Limiting

A service can only do so much work per second. Push more at it and latency climbs, queues grow, memory swells, and eventually it falls over — taking everyone’s requests down, not just the excess. Rate limiting is the deliberate act of saying “no” to some requests so the service can keep saying “yes” to the rest. It’s a throttle: it caps the rate of accepted work to a level the system can actually sustain.

The motivations are twofold and worth separating, because they pull in slightly different directions. Protection: stop any source — a buggy client in a retry loop, a scraper, a denial-of-service attack, a sudden viral spike — from overwhelming the service. Fairness: ensure no single user can monopolize shared capacity at the expense of everyone else. The same mechanism serves both, but “protect the service” and “be fair to users” occasionally want different limits, as we’ll see.

The core algorithms

There are four classic algorithms. They differ in how they handle bursts — short spikes of traffic above the average rate — which turns out to be the whole game.

Token bucket — allow controlled bursts

Imagine a bucket that holds up to B tokens and refills at R tokens per second. Each request must take one token to proceed; if the bucket is empty, the request is rejected (or queued). Because the bucket can hold up to B tokens, a client that’s been quiet can spend a burst of up to B requests at once, then is limited to the steady refill rate R thereafter.

   refill at R tokens/sec
            │
            ▼
        ┌───────┐  bucket holds up to B tokens
        │ ● ● ● │  each request removes one token
        │ ● ●   │  empty bucket → request rejected
        └───────┘

This is the most widely used algorithm because the burst behavior matches reality: real users are bursty (a page load fires ten requests at once), and token bucket accommodates legitimate bursts while still enforcing a long-run average. Buys: flexibility for natural bursts. Costs: a client can still hit you with a full bucket’s burst, so downstream must tolerate B at once.

Leaky bucket — smooth the output

A leaky bucket is a queue that drains at a fixed rate R. Requests pour in and are processed at a steady drip regardless of how spiky the arrivals are; if the queue fills, new requests overflow and are dropped. Where token bucket allows bursts through, leaky bucket absorbs them and emits a perfectly smooth stream.

Buys: a constant, predictable output rate — ideal when the thing you’re protecting hates bursts (e.g., a downstream with a hard per-second ceiling). Costs: added latency (requests wait in the queue) and no burst friendliness — a legitimate spike gets smeared out or dropped.

Fixed window — simple, but bursty at the seams

Count requests per fixed time window: “max 100 per minute.” Reset the counter each minute. Dead simple and cheap. The flaw is the boundary burst: a client can send 100 requests at 11:00:59 and another 100 at 11:01:00 — 200 requests in one second, double the intended limit, because the window reset between them.

   minute window      |  minute window
   ......... 100 reqs  | 100 reqs .........
            ^ 11:00:59 | 11:01:00 ^
            └─ 200 requests in ~1 second ─┘

Sliding window — accurate, at a cost

Sliding-window approaches fix the boundary problem by considering a rolling time range rather than fixed buckets. Sliding window log keeps a timestamp for every request and counts those within the last 60 seconds — exact, but memory-hungry (you store every request). Sliding window counter approximates this by weighting the previous window’s count, getting most of the accuracy with far less storage.

Algorithm	Bursts	Memory	Accuracy
Token bucket	Allows up to `B`	Low	Good
Leaky bucket	Smooths away	Low (queue)	Good
Fixed window	Boundary spikes	Lowest	Poor at edges
Sliding window	Controlled	Higher	Best

Where to enforce it

Rate limiting is most effective at the edge, before bad traffic consumes deep resources. The earlier you reject a request, the less it costs you. The natural home is the API gateway or reverse proxy — it sees all inbound traffic, can limit per-API-key/IP/user before requests ever reach your services, and centralizes the policy. You’ll often layer limits: a global limit at the edge (protect the whole system), per-tenant limits (fairness), and a per-service limit deeper in (protect one component).

Protection vs. fairness, and what to do at the limit

The two motivations can conflict. A strict fairness limit (each user gets 1/N of capacity) can leave the service underused when most users are idle — protection alone would happily let one active user use the spare capacity. A pure protection limit (cap total load) can let one heavy user starve everyone else right up until the global ceiling. Real systems blend them: a generous per-user limit for fairness, plus a global limit for protection, and sometimes dynamic limits that tighten as the system approaches saturation.

When a request exceeds the limit, you have choices, each a trade-off. Reject it (return HTTP 429 “Too Many Requests”, ideally with a Retry-After header so well-behaved clients back off) — fast and protective, but the client loses the request. Throttle/queue it — kinder to the client, but adds latency and risks the queue itself becoming a resource sink. The honest signal is the 429 with Retry-After: it tells clients to slow down rather than retry-storm you, tying back to the backoff discipline from Timeouts, Retries & Backoff.

What does rate limiting buy us, and what does it cost? It buys survival under overload and protection against abusive or runaway clients — a service that throttles excess stays up for everyone. It costs you rejected legitimate requests at the margin, the operational burden of choosing and tuning limits, and (at scale) a shared-counter dependency on the hot path of every request.

Check your understanding

Contrast token bucket and leaky bucket specifically in how they handle bursts. When would you prefer each?
Explain the fixed-window boundary problem with a concrete example, and how sliding window fixes it (and at what cost).
Why is the edge / API gateway the natural place to enforce rate limits? What does layering limits at multiple tiers buy you?
Why is rate limiting across a fleet of servers harder than on a single server, and what’s the standard fix and its downside?
Distinguish the “protection” and “fairness” motivations, and give a scenario where a limit serving one undermines the other.