Availability, SLAs & the Nines

“We offer four nines of availability.” It sounds precise and reassuring, and it is one of the most misunderstood claims in the industry. This page makes it concrete: what a nine buys you in real downtime, the three acronyms people mix up (SLA, SLO, SLI), and — most importantly — the simple arithmetic that tells you whether adding components makes you more or less reliable.

Availability, defined

Availability is the fraction of time a system is working and able to serve requests:

                 uptime
   availability = ------------------ = uptime / (uptime + downtime)
                  uptime + downtime

We almost always express it as a percentage, and we count the nines. The reason engineers obsess over an extra nine is that each one is ten times harder than the last — and the difference in allowed downtime is enormous.

Availability	”Nines”	Downtime / year	Downtime / day
99%	two	~3.65 days	~14 min
99.9%	three	~8.8 hours	~1.4 min
99.99%	four	~52 minutes	~8.6 sec
99.999%	five	~5.3 minutes	~0.86 sec

The jump from three nines to five nines is the difference between “a coffee break of downtime per day” and “you blinked and missed it.” Five nines is extraordinarily expensive — it usually rules out any process that needs a human to wake up and respond, because humans are slower than 5 minutes/year.

SLA vs SLO vs SLI — three different things

These three are constantly conflated. They form a chain from measurement to promise:

   SLI  →  what you MEASURE      (an actual number: "99.97% of requests succeeded")
   SLO  →  what you TARGET       (an internal goal:  "we aim for 99.95%")
   SLA  →  what you PROMISE      (a contract w/ penalties: "99.9%, or you get a refund")

SLI — Service Level Indicator. A real, measured quantity: success rate, p99 latency, uptime. This is reality.
SLO — Service Level Objective. Your internal target for an SLI. This is your goal, usually set tighter than what you promise customers, so you have room to miss it without breaking a contract.
SLA — Service Level Agreement. A contractual promise to customers, with consequences (refunds, credits) if you break it. This is legal/business.

The math that actually matters: series vs parallel

Here is the part that changes how you design. When you compose components, their availabilities combine — and the rule depends on whether they’re in series or in parallel.

Components in series → multiply (it gets worse)

If your request must pass through several components and all of them must work, the availabilities multiply:

   A_total = A_1 × A_2 × A_3 × ...

   web server (99.9%) → app (99.9%) → database (99.9%)
   0.999 × 0.999 × 0.999 ≈ 0.997  →  only 99.7%  (~26 hours/year down!)

Three “three-nines” components in a row give you less than three nines together. This is brutal and non-obvious: every dependency you add on the critical path drags your availability down. A chain is exactly as weak as the product of its links. This is why minimizing hard dependencies — and degrading gracefully when a non-critical one is down — is a core reliability skill.

Components in parallel/redundant → it gets better

If you have redundant components and the system works as long as at least one works, you compute the probability that they all fail and subtract from 1:

   A_total = 1 − (1 − A_1) × (1 − A_2) × ...

   two database replicas, each 99% (1% fail chance each):
   1 − (0.01 × 0.01) = 1 − 0.0001 = 0.9999  →  99.99%

Two unreliable (99%) replicas, run in parallel, become a four-nines pair — because both have to be down at the same instant for you to be down. Redundancy multiplies failure probabilities, which are small, so the product is tiny. This is the mathematical heart of every redundancy strategy in this textbook; the mechanics of doing it for real are covered in Redundancy & Failover.

   SERIES (all must work)        PARALLEL (one is enough)
   --------------------          ------------------------
   multiply availabilities       multiply the failure chances
   → number goes DOWN            → number goes UP
   add deps = more fragile       add redundancy = more robust

The thread

What do the nines buy us, and what do they cost? Each additional nine buys real, measurable uptime — and costs roughly 10× more in redundancy, automation, testing, and operational discipline than the nine before it. The series/parallel math is the lever: you gain availability by adding redundant copies (parallel) and you lose it with every dependency on the critical path (series). Good reliability engineering is mostly the discipline of keeping the critical path short and the redundant paths truly independent — and being honest that five nines is a price most systems neither need nor can afford. Next, the deepest trade-off of all: what redundancy costs you in consistency when the network splits, in The CAP Theorem.

Check your understanding

A year is ~526,000 minutes. Without the table, compute roughly how much yearly downtime “three nines” (99.9%) allows.
Distinguish SLI, SLO, and SLA. Why is a sensible SLO set stricter than the SLA, and what is the gap called?
You chain four components, each at 99.9%, all on the critical path. Is the result better or worse than 99.9%, and roughly what? State the rule.
You put two 99% components in parallel (one is enough). What’s the combined availability, and what formula did you use?
Name two reasons the parallel/redundancy math can lie in practice, and connect them to “what does it buy us, what does it cost?”