Availability, SLAs & the Nines
“We offer four nines of availability.” It sounds precise and reassuring, and it is one of the most misunderstood claims in the industry. This page makes it concrete: what a nine buys you in real downtime, the three acronyms people mix up (SLA, SLO, SLI), and — most importantly — the simple arithmetic that tells you whether adding components makes you more or less reliable.
Availability, defined
Section titled “Availability, defined”Availability is the fraction of time a system is working and able to serve requests:
uptime availability = ------------------ = uptime / (uptime + downtime) uptime + downtimeWe almost always express it as a percentage, and we count the nines. The reason engineers obsess over an extra nine is that each one is ten times harder than the last — and the difference in allowed downtime is enormous.
| Availability | ”Nines” | Downtime / year | Downtime / day |
|---|---|---|---|
| 99% | two | ~3.65 days | ~14 min |
| 99.9% | three | ~8.8 hours | ~1.4 min |
| 99.99% | four | ~52 minutes | ~8.6 sec |
| 99.999% | five | ~5.3 minutes | ~0.86 sec |
The jump from three nines to five nines is the difference between “a coffee break of downtime per day” and “you blinked and missed it.” Five nines is extraordinarily expensive — it usually rules out any process that needs a human to wake up and respond, because humans are slower than 5 minutes/year.
SLA vs SLO vs SLI — three different things
Section titled “SLA vs SLO vs SLI — three different things”These three are constantly conflated. They form a chain from measurement to promise:
SLI → what you MEASURE (an actual number: "99.97% of requests succeeded") SLO → what you TARGET (an internal goal: "we aim for 99.95%") SLA → what you PROMISE (a contract w/ penalties: "99.9%, or you get a refund")- SLI — Service Level Indicator. A real, measured quantity: success rate, p99 latency, uptime. This is reality.
- SLO — Service Level Objective. Your internal target for an SLI. This is your goal, usually set tighter than what you promise customers, so you have room to miss it without breaking a contract.
- SLA — Service Level Agreement. A contractual promise to customers, with consequences (refunds, credits) if you break it. This is legal/business.
The math that actually matters: series vs parallel
Section titled “The math that actually matters: series vs parallel”Here is the part that changes how you design. When you compose components, their availabilities combine — and the rule depends on whether they’re in series or in parallel.
Components in series → multiply (it gets worse)
Section titled “Components in series → multiply (it gets worse)”If your request must pass through several components and all of them must work, the availabilities multiply:
A_total = A_1 × A_2 × A_3 × ...
web server (99.9%) → app (99.9%) → database (99.9%) 0.999 × 0.999 × 0.999 ≈ 0.997 → only 99.7% (~26 hours/year down!)Three “three-nines” components in a row give you less than three nines together. This is brutal and non-obvious: every dependency you add on the critical path drags your availability down. A chain is exactly as weak as the product of its links. This is why minimizing hard dependencies — and degrading gracefully when a non-critical one is down — is a core reliability skill.
Components in parallel/redundant → it gets better
Section titled “Components in parallel/redundant → it gets better”If you have redundant components and the system works as long as at least one works, you compute the probability that they all fail and subtract from 1:
A_total = 1 − (1 − A_1) × (1 − A_2) × ...
two database replicas, each 99% (1% fail chance each): 1 − (0.01 × 0.01) = 1 − 0.0001 = 0.9999 → 99.99%Two unreliable (99%) replicas, run in parallel, become a four-nines pair — because both have to be down at the same instant for you to be down. Redundancy multiplies failure probabilities, which are small, so the product is tiny. This is the mathematical heart of every redundancy strategy in this textbook; the mechanics of doing it for real are covered in Redundancy & Failover.
SERIES (all must work) PARALLEL (one is enough) -------------------- ------------------------ multiply availabilities multiply the failure chances → number goes DOWN → number goes UP add deps = more fragile add redundancy = more robustThe thread
Section titled “The thread”What do the nines buy us, and what do they cost? Each additional nine buys real, measurable uptime — and costs roughly 10× more in redundancy, automation, testing, and operational discipline than the nine before it. The series/parallel math is the lever: you gain availability by adding redundant copies (parallel) and you lose it with every dependency on the critical path (series). Good reliability engineering is mostly the discipline of keeping the critical path short and the redundant paths truly independent — and being honest that five nines is a price most systems neither need nor can afford. Next, the deepest trade-off of all: what redundancy costs you in consistency when the network splits, in The CAP Theorem.
Check your understanding
Section titled “Check your understanding”- A year is ~526,000 minutes. Without the table, compute roughly how much yearly downtime “three nines” (99.9%) allows.
- Distinguish SLI, SLO, and SLA. Why is a sensible SLO set stricter than the SLA, and what is the gap called?
- You chain four components, each at 99.9%, all on the critical path. Is the result better or worse than 99.9%, and roughly what? State the rule.
- You put two 99% components in parallel (one is enough). What’s the combined availability, and what formula did you use?
- Name two reasons the parallel/redundancy math can lie in practice, and connect them to “what does it buy us, what does it cost?”