Circuit Breakers & Bulkheads

The previous page ended on a warning: when a dependency is down, retrying into it can be what keeps it down. There’s a deeper version of that insight. When a downstream service is failing, the correct thing to do is often stop calling it entirely — not retry, not even try once. Every call you make to a dead service wastes a thread waiting for a timeout, holds a connection, and adds load to something that’s already on fire. The healthiest response to a sick dependency is to leave it alone until it recovers.

That’s the circuit breaker. Its companion, the bulkhead, attacks the same cascade from a different angle: instead of stopping bad calls, it contains them so they can only damage a corner of your system. Both are named after physical safety devices, and the metaphors are exact.

The circuit breaker: fail fast, give the patient room to heal

An electrical circuit breaker trips and cuts the current when it detects a fault, protecting the wiring from melting. A software circuit breaker does the same for service calls: it sits between you and a dependency, watches the success/failure rate, and when failures cross a threshold it trips — for a while, calls to that dependency fail instantly without even being attempted.

The three states

                  failures exceed threshold
        ┌──────────────────────────────────────────┐
        │                                            ▼
   ┌─────────┐                                  ┌─────────┐
   │ CLOSED  │                                  │  OPEN   │
   │ calls   │                                  │ calls   │
   │ pass    │◄──── trial call succeeds ───┐    │ fail    │
   │ through │                              │   │ instantly│
   └─────────┘                              │   └─────────┘
        ▲                              ┌─────────┐   │
        │                              │HALF-OPEN│◄──┘ after a
        └── trial fails, back to OPEN ─│ let ONE │  cooldown timer
                                       │ trial   │
                                       │ through │
                                       └─────────┘

Closed — normal operation. Calls flow through; the breaker just counts outcomes. If the failure rate stays low, it does nothing. (Confusingly, “closed” means working — it follows the electrical metaphor: a closed circuit conducts.)
Open — tripped. The dependency is judged unhealthy, so every call fails immediately without touching the network. No threads wait on timeouts; no load is added to the struggling service. A cooldown timer runs.
Half-open — after the cooldown, the breaker cautiously lets one (or a few) trial calls through. If they succeed, the dependency has recovered → back to closed. If they fail, it’s still sick → back to open for another cooldown.

Why this is so valuable

Two big wins. First, it breaks the cascade: a dead dependency stops consuming your threads and connections, so your service stays healthy and responsive (it can return a fast error or a fallback) instead of hanging and dragging its callers down. Second, it lets the sick service recover by removing the firehose of doomed requests — the open state is a deliberate gift of breathing room.

What does a circuit breaker buy us, and what does it cost? It buys fast failure and cascade containment — you stop pouring effort into a hole. It costs you availability during false trips: a breaker that trips too eagerly will cut off a dependency that was only briefly slow, turning a minor degradation into a hard “service unavailable.” Tuning the threshold, the window, and the cooldown is a real trade-off, and an over-sensitive breaker is its own kind of outage.

The bulkhead: isolate so one leak can’t sink the ship

A ship’s hull is divided into watertight bulkhead compartments. Puncture one and that compartment floods — but the others stay dry and the ship stays afloat. Without bulkheads, a single breach floods the whole hull. The software pattern is identical: partition your resources so that exhausting one pool can’t starve the others.

The classic failure this prevents: your service calls dependencies A, B, and C, all sharing one thread pool of 100 threads. Dependency C gets slow. Calls to C hang and pile up, and within seconds all 100 threads are stuck waiting on C. Now calls to the perfectly healthy A and B can’t get a thread either. C’s problem has taken down A and B through shared resource exhaustion — a cascade through a shared pool.

The bulkhead fix: give each dependency its own isolated pool.

WITHOUT BULKHEADS                  WITH BULKHEADS
┌─────────────────────┐           ┌──────┐ ┌──────┐ ┌──────┐
│  shared pool (100)  │           │ A:30 │ │ B:30 │ │ C:40 │
│  C floods all 100   │           │  ok  │ │  ok  │ │FLOOD │
│  → A, B starve too  │           └──────┘ └──────┘ └──────┘
└─────────────────────┘            C drowns alone; A & B serve on

Now when C floods, it can consume at most its own 40 threads. A and B keep their pools and keep serving. The blast radius of C’s failure is contained to C.

What bulkheads cost

What does this buy us, and what does it cost? It buys fault isolation — one bad dependency can no longer starve the rest. It costs you utilization and flexibility: partitioned pools can’t share spare capacity, so under normal load you may have idle threads in A’s pool while B’s pool is momentarily saturated, where a shared pool would have lent them over. You trade some efficiency for the guarantee that a local failure stays local. Bulkheads also exist at coarser grains — separate connection pools, separate processes, separate clusters, even separate accounts — each a stronger (and more expensive) wall.

How they work together

Circuit breakers and bulkheads are complementary, not competing. The bulkhead ensures a failing dependency can only consume its slice of resources — it caps the blast radius spatially. The circuit breaker ensures that once that slice is clearly failing, you stop spending it on doomed calls and let the dependency recover — it caps the damage temporally. Together with timeouts and bounded retries, they form the standard toolkit for keeping a single sick component from becoming a system-wide outage. The recurring thread holds: each wall you build buys isolation and survival, and charges you in efficiency, idle capacity, and the risk of a misconfigured threshold cutting off something that was actually fine.

Check your understanding

Why is “stop calling a failing dependency entirely” often healthier than retrying it? What two problems does that solve at once?
Walk through the three circuit-breaker states and the transitions between them. Why is the half-open state necessary rather than going straight from open to closed?
What is the cost of a circuit breaker tuned too aggressively, and how does it become its own outage?
Describe the shared-thread-pool cascade a bulkhead prevents, step by step.
Bulkheads improve isolation but hurt one specific metric. Which one, and why is that the price of containment?