Skip to content

Circuit Breakers & Bulkheads

The previous page ended on a warning: when a dependency is down, retrying into it can be what keeps it down. There’s a deeper version of that insight. When a downstream service is failing, the correct thing to do is often stop calling it entirely — not retry, not even try once. Every call you make to a dead service wastes a thread waiting for a timeout, holds a connection, and adds load to something that’s already on fire. The healthiest response to a sick dependency is to leave it alone until it recovers.

That’s the circuit breaker. Its companion, the bulkhead, attacks the same cascade from a different angle: instead of stopping bad calls, it contains them so they can only damage a corner of your system. Both are named after physical safety devices, and the metaphors are exact.

The circuit breaker: fail fast, give the patient room to heal

Section titled “The circuit breaker: fail fast, give the patient room to heal”

An electrical circuit breaker trips and cuts the current when it detects a fault, protecting the wiring from melting. A software circuit breaker does the same for service calls: it sits between you and a dependency, watches the success/failure rate, and when failures cross a threshold it trips — for a while, calls to that dependency fail instantly without even being attempted.

failures exceed threshold
┌──────────────────────────────────────────┐
│ ▼
┌─────────┐ ┌─────────┐
│ CLOSED │ │ OPEN │
│ calls │ │ calls │
│ pass │◄──── trial call succeeds ───┐ │ fail │
│ through │ │ │ instantly│
└─────────┘ │ └─────────┘
▲ ┌─────────┐ │
│ │HALF-OPEN│◄──┘ after a
└── trial fails, back to OPEN ─│ let ONE │ cooldown timer
│ trial │
│ through │
└─────────┘
  • Closed — normal operation. Calls flow through; the breaker just counts outcomes. If the failure rate stays low, it does nothing. (Confusingly, “closed” means working — it follows the electrical metaphor: a closed circuit conducts.)
  • Open — tripped. The dependency is judged unhealthy, so every call fails immediately without touching the network. No threads wait on timeouts; no load is added to the struggling service. A cooldown timer runs.
  • Half-open — after the cooldown, the breaker cautiously lets one (or a few) trial calls through. If they succeed, the dependency has recovered → back to closed. If they fail, it’s still sick → back to open for another cooldown.

Two big wins. First, it breaks the cascade: a dead dependency stops consuming your threads and connections, so your service stays healthy and responsive (it can return a fast error or a fallback) instead of hanging and dragging its callers down. Second, it lets the sick service recover by removing the firehose of doomed requests — the open state is a deliberate gift of breathing room.

What does a circuit breaker buy us, and what does it cost? It buys fast failure and cascade containment — you stop pouring effort into a hole. It costs you availability during false trips: a breaker that trips too eagerly will cut off a dependency that was only briefly slow, turning a minor degradation into a hard “service unavailable.” Tuning the threshold, the window, and the cooldown is a real trade-off, and an over-sensitive breaker is its own kind of outage.

The bulkhead: isolate so one leak can’t sink the ship

Section titled “The bulkhead: isolate so one leak can’t sink the ship”

A ship’s hull is divided into watertight bulkhead compartments. Puncture one and that compartment floods — but the others stay dry and the ship stays afloat. Without bulkheads, a single breach floods the whole hull. The software pattern is identical: partition your resources so that exhausting one pool can’t starve the others.

The classic failure this prevents: your service calls dependencies A, B, and C, all sharing one thread pool of 100 threads. Dependency C gets slow. Calls to C hang and pile up, and within seconds all 100 threads are stuck waiting on C. Now calls to the perfectly healthy A and B can’t get a thread either. C’s problem has taken down A and B through shared resource exhaustion — a cascade through a shared pool.

The bulkhead fix: give each dependency its own isolated pool.

WITHOUT BULKHEADS WITH BULKHEADS
┌─────────────────────┐ ┌──────┐ ┌──────┐ ┌──────┐
│ shared pool (100) │ │ A:30 │ │ B:30 │ │ C:40 │
│ C floods all 100 │ │ ok │ │ ok │ │FLOOD │
│ → A, B starve too │ └──────┘ └──────┘ └──────┘
└─────────────────────┘ C drowns alone; A & B serve on

Now when C floods, it can consume at most its own 40 threads. A and B keep their pools and keep serving. The blast radius of C’s failure is contained to C.

What does this buy us, and what does it cost? It buys fault isolation — one bad dependency can no longer starve the rest. It costs you utilization and flexibility: partitioned pools can’t share spare capacity, so under normal load you may have idle threads in A’s pool while B’s pool is momentarily saturated, where a shared pool would have lent them over. You trade some efficiency for the guarantee that a local failure stays local. Bulkheads also exist at coarser grains — separate connection pools, separate processes, separate clusters, even separate accounts — each a stronger (and more expensive) wall.

Circuit breakers and bulkheads are complementary, not competing. The bulkhead ensures a failing dependency can only consume its slice of resources — it caps the blast radius spatially. The circuit breaker ensures that once that slice is clearly failing, you stop spending it on doomed calls and let the dependency recover — it caps the damage temporally. Together with timeouts and bounded retries, they form the standard toolkit for keeping a single sick component from becoming a system-wide outage. The recurring thread holds: each wall you build buys isolation and survival, and charges you in efficiency, idle capacity, and the risk of a misconfigured threshold cutting off something that was actually fine.

  1. Why is “stop calling a failing dependency entirely” often healthier than retrying it? What two problems does that solve at once?
  2. Walk through the three circuit-breaker states and the transitions between them. Why is the half-open state necessary rather than going straight from open to closed?
  3. What is the cost of a circuit breaker tuned too aggressively, and how does it become its own outage?
  4. Describe the shared-thread-pool cascade a bulkhead prevents, step by step.
  5. Bulkheads improve isolation but hurt one specific metric. Which one, and why is that the price of containment?