Caching

Caching is the highest-leverage performance technique in all of systems, and it rests on a single observation: the same answers are requested over and over. If computing or fetching an answer is expensive, store it the first time and hand out the copy for free thereafter. A cache is just a fast store of answers you’ve already produced, placed close to where they’re needed.

The leverage is enormous — a read that took 50 ms against a database can take well under a millisecond from memory — but caching carries one unavoidable cost that shapes every decision around it: a cache holds a copy, and copies go stale. The entire discipline of caching is managing the gap between the copy and the truth.

What does a cache buy us, and what does it cost? It buys latency and throughput (and relieves load on the slow thing behind it). It costs you staleness, memory, and a new source of bugs: serving data that is no longer correct.

Why it works: locality

Caches pay off because real access patterns are skewed, not uniform:

Temporal locality — something accessed now is likely to be accessed again soon (a trending post, a logged-in user’s profile).
Spatial / popularity locality — a small fraction of items get the vast majority of requests (the classic 80/20, often far steeper). Cache the hot few and you serve most traffic from cache.

If every item were equally and rarely accessed, a cache would help little. It’s the skew that makes caching a superpower.

The one metric: hit ratio

The number that decides whether a cache is worth anything is the cache hit ratio — the fraction of requests served from the cache rather than the slow backend:

   hit ratio = hits / (hits + misses)

   effective latency ≈ hit_ratio × cache_latency
                     + (1 − hit_ratio) × backend_latency

A small ratio change matters more than it looks. Going from 90% to 99% hits cuts the miss rate by 10× — and misses are where all the slow, expensive work happens. The miss path is also where danger lives: when many misses hit the backend at once (a stampede), the cache that was protecting your database becomes the funnel that overloads it. That failure mode has its own page — Cache Stampede & Thundering Herd.

Eviction: a cache is finite

A cache is small on purpose (fast memory is expensive), so it must constantly decide what to throw out to make room. The eviction policy is that decision:

Policy	Evicts	Good when
LRU (Least Recently Used)	the item untouched longest	recent use predicts future use (usually)
LFU (Least Frequently Used)	the item with fewest hits	popularity is stable over time
TTL (Time To Live)	anything older than a set age	data has a natural freshness window
FIFO	the oldest inserted	simple; rarely optimal

LRU is the workhorse — it captures temporal locality cheaply. LFU resists a burst of one-off requests evicting your genuinely popular items, but adapts poorly when popularity shifts. TTL is special: it’s not really about space, it’s about correctness — it’s how you cap staleness by saying “this copy may be wrong after N seconds, so stop trusting it.” Most real caches combine TTL (for freshness) with LRU/LFU (for space).

The cache hierarchy

Caching isn’t one box — it’s a series of them along the request path, each closer to the user and each catching what the one behind it would have served:

   Browser cache        ← on the user's device; zero network. Static assets, API responses.
        │ miss
   CDN edge cache       ← near the user (see CDNs page). Static + cacheable dynamic content.
        │ miss
   Reverse-proxy cache  ← at your front door (e.g. Varnish/nginx). Full-page or fragment cache.
        │ miss
   Application cache     ← in-memory near app servers (Redis/Memcached). Query results, sessions, objects.
        │ miss
   Database cache        ← the DB's own buffer pool / query cache.
        │ miss
   Disk / origin compute ← the slow source of truth. The thing every layer above is protecting.

The deeper a request falls, the more expensive it gets. Each layer’s job is to make sure as few requests as possible fall to the next one. A cache miss at the browser might still hit at the CDN; a CDN miss might still hit the app cache. The hierarchy multiplies your hit ratio.

Staleness: the cost you can never fully escape

The moment you store a copy, you accept that the truth might move and your copy won’t know. Every caching strategy is a different answer to how do we deal with that? — expire it after a TTL, update the cache when the source changes, or accept eventual convergence. These strategies (write-through, write-back, write-around, cache-aside) and their consistency implications are the subject of Caching Strategies, which builds directly on this page. Here, hold onto the principle:

There is no caching without staleness. Caching is the art of choosing how much staleness you can tolerate, for which data, and how you’ll bound it.

Some data tolerates seconds of staleness without anyone noticing (a view count). Some tolerates none (an account balance, a permission check) and either shouldn’t be cached or must be invalidated precisely. Knowing which is which — per field, not per system — is the skill.

Check your understanding

Why does caching depend on locality? What would a perfectly uniform access pattern do to a cache’s value?
Why does improving hit ratio from 90% to 99% matter so much more than from 50% to 59%?
Contrast LRU and LFU. Why is TTL eviction fundamentally about correctness rather than space?
Walk a request down the cache hierarchy. Why does per-user data belong in the app cache while a public image belongs at the CDN?
Restate the core cost of caching in one sentence. Give one piece of data you’d never cache and say why.