Skip to content

Disaster Recovery (RPO/RTO)

Redundancy (covered earlier) handles a server or a rack failing. Disaster recovery (DR) handles the catastrophe: an entire datacenter or cloud region goes dark — fire, flood, a fat-fingered region-wide deletion, a provider outage. The question stops being “is it up?” and becomes “how much data can we afford to lose, and how fast must we be back?” Those two numbers — RPO and RTO — turn a vague fear into an engineering budget.

┌─────────── disaster strikes ───────────┐
│ │
▼ ▼
...──[ last good backup ]───────[ X ]───────[ outage ]────────[ service restored ]──...
└────── RPO ──────┘ └──────── RTO ────────┘
how much DATA you lose how much TIME you're down
  • RPO — Recovery Point Objective: the maximum acceptable data loss, measured in time. “RPO = 5 minutes” means after a disaster you may lose up to the last 5 minutes of writes. RPO is set by your backup/replication frequency.
  • RTO — Recovery Time Objective: the maximum acceptable downtime. “RTO = 1 hour” means you must be serving again within an hour. RTO is set by how fast your failover/restore runs.

Each strategy is a point on the cost-vs-RPO/RTO curve. What does this buy us, and what does it cost? is literally the design question here.

StrategyHow it worksRPO / RTOCost
Backup & restorePeriodic backups to another region; spin up infra after disasterHours / hours$
Pilot lightCore data replicated continuously; minimal infra idle, scale up on failoverMinutes / tens of min$$
Warm standbyA scaled-down copy of the system always running in another regionSeconds-min / minutes$$$
Multi-site active-activeFull system live in 2+ regions serving traffic simultaneously~0 / ~0$$$$

The jump from warm standby to active-active is where costs explode — you’re now running (and paying for) your entire stack twice and solving cross-region data consistency (see Replication and CAP & PACELC).

Why backups are necessary but not sufficient

Section titled “Why backups are necessary but not sufficient”

Replication protects against hardware failure but faithfully replicates mistakes: a bad migration or a DELETE with no WHERE clause propagates to every replica instantly. Backups — point-in-time copies you can roll back to — are your only defense against logical corruption and human error.

Don’t default to “zero.” Derive RPO/RTO from the cost of downtime and data loss versus the cost of the DR tier:

A payments ledger: RPO ≈ 0 (losing a transaction is unacceptable) → active-active / sync replication
An analytics dashboard: RPO = 24h, RTO = 1 day (a day-old rebuild is fine) → nightly backup & restore

Different parts of one system often warrant different tiers — protect the ledger fiercely, let the recommendation cache rebuild lazily.

What does this buy us, and what does it cost? Every step toward zero RPO/RTO buys resilience against ever-larger disasters and costs real money and complexity (duplicate infrastructure, cross-region consistency). DR planning is the discipline of pricing the disaster honestly and buying exactly as much protection as each piece of data deserves — no more, no less.

  1. Define RPO and RTO precisely, and say which one each of “backup frequency” and “failover speed” determines.
  2. Why does replication not protect you from a bad migration, and what does?
  3. Order the four DR strategies by cost and explain what jumps so sharply at active-active.
  4. Why is an untested backup dangerous, and what’s the fix?
  5. Give two systems that justify very different RPO targets, and explain why.