Disaster Recovery (RPO/RTO)

Redundancy (covered earlier) handles a server or a rack failing. Disaster recovery (DR) handles the catastrophe: an entire datacenter or cloud region goes dark — fire, flood, a fat-fingered region-wide deletion, a provider outage. The question stops being “is it up?” and becomes “how much data can we afford to lose, and how fast must we be back?” Those two numbers — RPO and RTO — turn a vague fear into an engineering budget.

The two numbers that define a DR plan

   ┌─────────── disaster strikes ───────────┐
   │                                         │
   ▼                                         ▼
...──[ last good backup ]───────[ X ]───────[ outage ]────────[ service restored ]──...
        └────── RPO ──────┘                  └──────── RTO ────────┘
        how much DATA you lose               how much TIME you're down

RPO — Recovery Point Objective: the maximum acceptable data loss, measured in time. “RPO = 5 minutes” means after a disaster you may lose up to the last 5 minutes of writes. RPO is set by your backup/replication frequency.
RTO — Recovery Time Objective: the maximum acceptable downtime. “RTO = 1 hour” means you must be serving again within an hour. RTO is set by how fast your failover/restore runs.

The DR strategies, cheapest to costliest

Each strategy is a point on the cost-vs-RPO/RTO curve. What does this buy us, and what does it cost? is literally the design question here.

Strategy	How it works	RPO / RTO	Cost
Backup & restore	Periodic backups to another region; spin up infra after disaster	Hours / hours	$
Pilot light	Core data replicated continuously; minimal infra idle, scale up on failover	Minutes / tens of min	$$
Warm standby	A scaled-down copy of the system always running in another region	Seconds-min / minutes	$$$
Multi-site active-active	Full system live in 2+ regions serving traffic simultaneously	~0 / ~0	$$$$

The jump from warm standby to active-active is where costs explode — you’re now running (and paying for) your entire stack twice and solving cross-region data consistency (see Replication and CAP & PACELC).

Why backups are necessary but not sufficient

Replication protects against hardware failure but faithfully replicates mistakes: a bad migration or a DELETE with no WHERE clause propagates to every replica instantly. Backups — point-in-time copies you can roll back to — are your only defense against logical corruption and human error.

Picking your targets

Don’t default to “zero.” Derive RPO/RTO from the cost of downtime and data loss versus the cost of the DR tier:

A payments ledger:   RPO ≈ 0   (losing a transaction is unacceptable) → active-active / sync replication
An analytics dashboard: RPO = 24h, RTO = 1 day (a day-old rebuild is fine) → nightly backup & restore

Different parts of one system often warrant different tiers — protect the ledger fiercely, let the recommendation cache rebuild lazily.

The thread

What does this buy us, and what does it cost? Every step toward zero RPO/RTO buys resilience against ever-larger disasters and costs real money and complexity (duplicate infrastructure, cross-region consistency). DR planning is the discipline of pricing the disaster honestly and buying exactly as much protection as each piece of data deserves — no more, no less.

Check your understanding

Define RPO and RTO precisely, and say which one each of “backup frequency” and “failover speed” determines.
Why does replication not protect you from a bad migration, and what does?
Order the four DR strategies by cost and explain what jumps so sharply at active-active.
Why is an untested backup dangerous, and what’s the fix?
Give two systems that justify very different RPO targets, and explain why.