Skip to content

Finding Performance Bottlenecks

Every other page in this part is a cure. This one is the diagnosis — and a cure applied to the wrong organ does nothing but add complexity. The hardest, most valuable skill in scaling is not knowing the techniques; it’s knowing which bottleneck you actually have before you spend effort relieving it. Engineers are notoriously, confidently wrong about where their systems spend time. The discipline is simple to state and hard to follow: measure first.

“Premature optimization is the root of all evil.” — Donald Knuth

Your mental model of where time goes is built from the code you wrote, but performance is decided by the code that runs — and the gap between them is enormous. The slow part is rarely the clever algorithm you fretted over; it’s the innocent-looking line that turns out to issue 400 database queries, or a lock everyone is quietly waiting on, or a network call you forgot was a network call.

where you THINK time goes where time ACTUALLY goes
┌──────────────────────┐ ┌──────────────────────┐
│ business logic ████ │ │ business logic ▌ │
│ serialization ██ │ │ DB / I/O wait █████ │
│ DB / I/O wait █ │ │ lock contention ██ │
└──────────────────────┘ └──────────────────────┘

Optimizing the wrong thing is worse than doing nothing: you add complexity and risk for zero speedup, and the real bottleneck is still there. So the first rule is — profile and measure on the real system under realistic load before changing anything. What does this buy us, and what does it cost? Measurement costs time up front; it buys you the certainty that your effort lands on the resource that’s actually saturated.

Brendan Gregg’s USE method is a fast, systematic checklist for finding a resource bottleneck. For every resource (CPU, memory, disk, network, and logical resources like DB connection pools or thread pools), ask three questions:

  • Utilization — what fraction of the time is it busy? (CPU at 95%? disk at 100%?)
  • Saturation — how much queued, waiting work is piling up beyond what it can handle? (a deep run queue, requests waiting for a connection)
  • Errors — is it throwing errors? (failed allocations, dropped packets, connection timeouts)
for each resource:
Utilization? ── is it busy?
Saturation? ── is work queuing up behind it?
Errors? ── is it failing?

The power of USE is that it’s exhaustive and cheap: a short, fixed list that points you at the saturated resource fast, instead of guessing. A resource with high utilization and rising saturation is your bottleneck. (Often the most revealing signal is saturation — a CPU at 70% with a long run queue is more in trouble than a CPU at 99% with nothing waiting.)

Once you’re measuring, the same offenders show up again and again. Knowing them sharpens where you look.

The database is the most common bottleneck in the most common systems. Missing indexes turn fast lookups into full table scans; a query that was fine at 10,000 rows crawls at 10,000,000. The first thing to check on a slow endpoint is usually the queries it issues and whether they’re indexed.

The classic ORM trap: you fetch a list of N items, then issue one more query per item to load a related field — 1 query becomes N+1.

SELECT * FROM posts; -- 1 query, returns 100 posts
for each post:
SELECT * FROM users WHERE id=? -- +100 queries, one per post ✗

The fix is to fetch in bulk (a join or a single WHERE id IN (...)), turning N+1 into 2. N+1 is invisible in code review and devastating under load — it’s why per-query count, not just per-query time, belongs in your monitoring.

When many requests need the same lock — a hot row, a global mutex, a coarse transaction — they serialize: they take turns instead of running in parallel. Adding more servers doesn’t help, because the bottleneck is the waiting, not the compute. The symptom is high saturation (lots of waiting) with low utilization (the resource isn’t even busy). The fix is finer-grained locking, shorter transactions, or removing the shared mutable state.

A request that makes 30 sequential calls to other services or the database pays the network latency 30 times over, even if each call is fast. In a distributed system the network is often the real cost. The fixes: batch (one call for many items), parallelize independent calls, and move logic closer to the data so it doesn’t have to make the trip. Each round trip you remove is latency removed from every request.

┌─► 1. MEASURE (USE method, profiles, p99) ──┐
│ 2. find the ONE binding constraint │
│ 3. relieve ONLY that constraint │
└── 4. measure again (it moved) ──────────────┘

This is the whole discipline. Everything else in Part 3 — replicas, caches, sharding, statelessness — is a tool you reach for after step 2 tells you which one you need. What does this buy us, and what does it cost? The loop costs the patience to measure before acting; it buys you a system where every unit of complexity you add is paying for itself, because it was aimed at a constraint you proved was real.

  1. Why is optimizing the wrong thing worse than doing nothing at all?
  2. State the three questions of the USE method and what each reveals.
  3. Why can lock contention show high saturation but low utilization, and why doesn’t adding servers fix it?
  4. Describe an N+1 query and the fix. Why is it nearly invisible in code review?
  5. Why is finding bottlenecks a loop rather than a one-time step, and how do you know when to stop?