Skip to content

Alerting & On-Call

Telemetry is useless if no one looks at it. Alerting is the bridge from “the system knows something is wrong” to “a human is doing something about it.” But an alert has a cost most teams underestimate: it spends a human’s attention, and worse, their trust. An alert that fires for nothing trains people to ignore alerts — including the one that mattered. This page is about firing the pager only when it should, and making sure that when it does, the person who wakes up can actually fix the thing.

The cardinal rule: alert on symptoms, not causes

Section titled “The cardinal rule: alert on symptoms, not causes”

This is the single most important idea in operational alerting. A cause is something internal an engineer worries about — “CPU is at 95%,” “the cache hit rate dropped,” “a replica fell behind.” A symptom is something a user actually feels — “checkout is returning errors,” “page loads take 8 seconds,” “no orders have completed in 5 minutes.”

Alert on symptoms. Here’s why the distinction is not pedantic:

  • A cause without a symptom is not an emergency. CPU at 95% with users perfectly happy is not worth waking someone. If you alert on it, you’ve spent a human’s night on a non-problem — and taught them the pager lies.
  • A symptom catches causes you never predicted. You cannot enumerate every way checkout can break, but you can detect that checkout is broken. Symptom alerts catch the unknown-unknowns that cause-based alerts miss entirely.
CAUSE-based alert SYMPTOM-based alert
"CPU > 90%" "checkout error rate > 2%"
"cache hit rate < 80%" "p99 latency > 1s"
"disk 85% full" "successful orders/min == 0"
───────────────────── ─────────────────────────────
may fire with no user impact fires exactly when users hurt
misses unpredicted failures catches the unknown-unknowns
→ belongs on a DASHBOARD → belongs on the PAGER

The RED metrics — rate, errors, duration — are symptom metrics by construction, which is why they make the best paging signals. Causes still belong on dashboards: once a symptom pages you, the cause dashboards are how you diagnose it. The rule is about what wakes someone up, not about what you measure.

How bad does a symptom have to be before it’s worth a 3 a.m. page? “Any errors at all” is too sensitive — every real system has a steady trickle of failures. The principled answer comes from SLOs and error budgets, which connect back to availability and the nines.

  • An SLO (Service Level Objective) is a target: “99.9% of checkout requests succeed over 30 days.”
  • The error budget is the inverse — the allowed failure. 99.9% over 30 days permits roughly 43 minutes of downtime (or its equivalent in errors) per month. That budget is a resource you are allowed to spend.

This reframes alerting entirely. You don’t page on “an error happened”; you page on the rate at which you’re burning the budget. A slow burn (you’ll exhaust the month’s budget in 3 days) is a ticket. A fast burn (you’ll exhaust it in an hour) is a page. Burn-rate alerting makes the pager proportional to actual danger:

budget remaining ████████████░░░░░░░░ 60%
slow burn → "you'll run out in 5 days" → ticket, investigate tomorrow
fast burn → "you'll run out in 45 min" → PAGE, act now

The error budget also resolves the eternal dev-vs-ops fight: if budget remains, ship features fast; if the budget is blown, freeze risky deploys and spend effort on reliability. The number turns a values argument into a data-driven decision.

Alert fatigue: the failure mode that kills alerting

Section titled “Alert fatigue: the failure mode that kills alerting”

The deadliest problem in on-call is not too few alerts — it’s too many. Alert fatigue is what happens when the pager cries wolf:

  • A noisy alert fires 40 times a week, 39 of them false. The on-call learns to dismiss it reflexively — and dismisses the 40th, the real one, too.
  • A flapping alert (toggling on/off on a borderline threshold) pages repeatedly for the same issue.
  • One root cause fires twenty downstream alerts at once — an alert storm — burying the actual signal.

The cost is paid in human reliability: a fatigued on-call misses real incidents, burns out, and quits. Fighting fatigue is continuous work: delete alerts that never lead to action, raise thresholds that fire too eagerly, add hysteresis so alerts don’t flap, group/deduplicate related alerts into one notification, and route by severity so a ticket never buzzes a phone at 3 a.m. A good rule: every page should be actionable, novel, and urgent. If it’s none of those, it shouldn’t be a page.

It’s 3 a.m., the pager fires, and the responder is half-asleep and not the person who wrote this service. What turns that page into a fix is a runbook — a short, specific document linked directly from the alert that says: what this alert means, how to confirm it’s real, the first diagnostic steps, known causes and their fixes, and how to escalate.

A good runbook is the difference between a 5-minute mitigation and a 2-hour fumble. It encodes the hard-won knowledge of past incidents so the next responder doesn’t rediscover it under pressure. The discipline that closes the loop is the blameless postmortem: after every incident, write down what happened, why (without blaming people — blame the system and the process), and what concrete change prevents a recurrence. Those changes flow back into better alerts, better runbooks, and often a safer deployment strategy — which is the next page.

  1. State the symptom-vs-cause rule. Why does a symptom alert catch failures you never predicted, and why does a cause without a symptom not deserve a page?
  2. Given a 99.9% monthly SLO, roughly how much downtime is the error budget? How does burn rate decide whether something is a page or a ticket?
  3. How does an error budget settle the argument between shipping features fast and protecting reliability?
  4. What is alert fatigue, what does it actually cost, and name three concrete tactics to reduce it.
  5. What does a runbook contain, and why does linking one from every page matter most at 3 a.m.?