Disaster Recovery (RPO/RTO)
Redundancy (covered earlier) handles a server or a rack failing. Disaster recovery (DR) handles the catastrophe: an entire datacenter or cloud region goes dark — fire, flood, a fat-fingered region-wide deletion, a provider outage. The question stops being “is it up?” and becomes “how much data can we afford to lose, and how fast must we be back?” Those two numbers — RPO and RTO — turn a vague fear into an engineering budget.
The two numbers that define a DR plan
Section titled “The two numbers that define a DR plan” ┌─────────── disaster strikes ───────────┐ │ │ ▼ ▼...──[ last good backup ]───────[ X ]───────[ outage ]────────[ service restored ]──... └────── RPO ──────┘ └──────── RTO ────────┘ how much DATA you lose how much TIME you're down- RPO — Recovery Point Objective: the maximum acceptable data loss, measured in time. “RPO = 5 minutes” means after a disaster you may lose up to the last 5 minutes of writes. RPO is set by your backup/replication frequency.
- RTO — Recovery Time Objective: the maximum acceptable downtime. “RTO = 1 hour” means you must be serving again within an hour. RTO is set by how fast your failover/restore runs.
The DR strategies, cheapest to costliest
Section titled “The DR strategies, cheapest to costliest”Each strategy is a point on the cost-vs-RPO/RTO curve. What does this buy us, and what does it cost? is literally the design question here.
| Strategy | How it works | RPO / RTO | Cost |
|---|---|---|---|
| Backup & restore | Periodic backups to another region; spin up infra after disaster | Hours / hours | $ |
| Pilot light | Core data replicated continuously; minimal infra idle, scale up on failover | Minutes / tens of min | $$ |
| Warm standby | A scaled-down copy of the system always running in another region | Seconds-min / minutes | $$$ |
| Multi-site active-active | Full system live in 2+ regions serving traffic simultaneously | ~0 / ~0 | $$$$ |
The jump from warm standby to active-active is where costs explode — you’re now running (and paying for) your entire stack twice and solving cross-region data consistency (see Replication and CAP & PACELC).
Why backups are necessary but not sufficient
Section titled “Why backups are necessary but not sufficient”Replication protects against hardware failure but faithfully replicates mistakes: a bad migration
or a DELETE with no WHERE clause propagates to every replica instantly. Backups — point-in-time
copies you can roll back to — are your only defense against logical corruption and human error.
Picking your targets
Section titled “Picking your targets”Don’t default to “zero.” Derive RPO/RTO from the cost of downtime and data loss versus the cost of the DR tier:
A payments ledger: RPO ≈ 0 (losing a transaction is unacceptable) → active-active / sync replicationAn analytics dashboard: RPO = 24h, RTO = 1 day (a day-old rebuild is fine) → nightly backup & restoreDifferent parts of one system often warrant different tiers — protect the ledger fiercely, let the recommendation cache rebuild lazily.
The thread
Section titled “The thread”What does this buy us, and what does it cost? Every step toward zero RPO/RTO buys resilience against ever-larger disasters and costs real money and complexity (duplicate infrastructure, cross-region consistency). DR planning is the discipline of pricing the disaster honestly and buying exactly as much protection as each piece of data deserves — no more, no less.
Check your understanding
Section titled “Check your understanding”- Define RPO and RTO precisely, and say which one each of “backup frequency” and “failover speed” determines.
- Why does replication not protect you from a bad migration, and what does?
- Order the four DR strategies by cost and explain what jumps so sharply at active-active.
- Why is an untested backup dangerous, and what’s the fix?
- Give two systems that justify very different RPO targets, and explain why.