Skip to content

Logging

A log is the oldest debugging tool that still works: a timestamped record of something that happened. “User 42 logged in.” “Payment failed: card declined.” “Cache miss for key cart:991.” Logs are the highest-fidelity pillar of observability — each line can carry as much context as you care to write — and for exactly that reason they are the most expensive to store and the easiest to turn into noise. This page builds logging up from the single print statement to a system that can reconstruct one request’s entire journey across a dozen machines.

A lone developer debugging a script writes print("got here"). It works because there is one process, one machine, one human reading one terminal. Production breaks every one of those assumptions: hundreds of processes, across many machines, with no human watching any single terminal. The print statement that saved you locally is invisible at scale. Logging is the engineering of that print statement into something you can find, filter, and read later, across the fleet, when you don’t yet know what you’re looking for.

Structured logging: the single biggest upgrade

Section titled “Structured logging: the single biggest upgrade”

The instinct is to log human prose:

2026-06-22 14:03:11 Payment failed for user 42 on order 991 after 1200ms

A human reads that fine. A machine — the thing that actually has to search billions of these — cannot. To find “all failed payments over 1000ms for user 42,” it must parse free text with fragile regexes. Structured logging emits machine-readable key–value records instead, usually JSON:

{
"ts": "2026-06-22T14:03:11Z",
"level": "error",
"msg": "payment failed",
"user_id": 42,
"order_id": 991,
"duration_ms": 1200,
"reason": "card_declined",
"trace_id": "a1b2c3..."
}

Now “failed payments over 1000ms for user 42” is a precise query over fields, not a guess over prose. What does this buy us? Queryable, aggregatable, filterable logs. What does it cost? A little discipline at every log site and slightly larger lines on disk — a trade almost always worth making.

Not every event deserves the same attention. Levels let you emit detail liberally but read selectively:

LevelMeaningTypical use
ERRORsomething failed and a user/operation was affectedexceptions, failed payments
WARNsomething unexpected, but handledretry succeeded, fell back to cache
INFOnormal, noteworthy eventsrequest served, job completed
DEBUGfine-grained detail for diagnosisvariable values, branch taken
TRACEfirehoseevery loop iteration

In production you typically run at INFO, keeping the firehose closed. When you’re chasing a bug, you turn the dial up to DEBUG — ideally without redeploying, via dynamic config. The level is your primary control over the eternal tension between enough detail to debug and few enough lines to afford.

Centralization: logs you can’t read are worthless

Section titled “Centralization: logs you can’t read are worthless”

A log file on a server that just crashed — or that autoscaling just terminated — is gone. And even healthy, logs scattered across 200 machines are unsearchable by hand. So production logging ships every line off the box, immediately, to a central store:

app server ─┐
app server ─┤ ship ┌──────────────┐ index &
app server ─┼──────────► │ log pipeline │ ────────► searchable store
app server ─┤ (agent / │ (buffer/parse)│ (query, dashboards,
worker ─┘ sidecar) └──────────────┘ retention, alerts)

A small agent on each host (or a sidecar) tails the logs, buffers them, and forwards them to a pipeline that parses, enriches, and indexes them into a searchable backend. The buffer matters: when the central store hiccups, you want logs queued locally, not dropped. This is the same producer/consumer decoupling you saw with message queues — the agent is a producer, the store is a consumer, and the buffer absorbs the mismatch.

Correlation IDs: stitching one request together

Section titled “Correlation IDs: stitching one request together”

Here is the problem centralization creates: now every machine’s logs are interleaved in one giant stream. A single user request touched the gateway, three services, and a database — and their log lines are scattered among millions of others. How do you pull back just this request’s story?

The answer is a correlation ID (a.k.a. request ID or trace ID): a unique token generated at the edge — typically at the reverse proxy or API gateway — and propagated through every downstream call, usually as an HTTP header. Every service stamps it onto every log line it emits for that request.

edge generates trace_id = a1b2c3
│ header: X-Request-Id: a1b2c3
gateway ──► service A ──► service B ──► database
(logs) (logs) (logs) (logs)
all carrying trace_id = a1b2c3

Now one query — trace_id = "a1b2c3" — reassembles the entire request from logs written on five different machines, in order. This is the logging counterpart to a full distributed trace; the correlation ID is the thread both pillars pull on. Without it, centralized logs are a haystack; with it, every request is a labeled needle.

Logging’s defining trade-off is brutal at scale. A busy service can emit more bytes of logs than it processes in actual data. The costs compound:

  • Storage and indexing. Indexed, searchable log storage is far pricier than raw object storage. Retention is a direct dial on the bill — 7 days vs 90 days can be a 10x difference.
  • Noise. When everything is logged at INFO, the one line that mattered is buried under a million that didn’t. Excessive logging actively reduces observability.
  • Performance. Synchronous logging on the hot path adds latency; high-volume logging can saturate disk or network I/O.

The mitigations are sampling (keep all errors, but only 1% of routine success lines), tiered retention (hot/searchable for days, cold/archived for months), and ruthless level discipline. The goal is never “log everything” — it’s “log what you’ll actually query,” because a log you never read is pure cost.

  1. Why does free-text logging break down at scale, and what specifically does structured (key–value) logging make possible?
  2. What is a log level for, and how does it let you reconcile “enough detail to debug” with “few enough lines to afford”?
  3. Why must logs be shipped off the host immediately, and what role does the local buffer play?
  4. Explain how a correlation ID lets you reconstruct one request’s story from logs written on many machines. Where is it generated, and how does it travel?
  5. Give two concrete reasons over-logging reduces observability, and name two mitigations.