Part 2 · Data: Storage & Retrieval

Code is liquid; data is rock. You can rewrite a service over a weekend, swap a framework, even change languages — and as long as the new code speaks the same protocols, nobody outside notices. Data is not like that. The schema you pick, the database engine you choose, the way you spread records across machines: these decisions calcify. They get baked into every query, every migration, every backup, every downstream consumer. Changing them later means moving live data while the system keeps serving traffic — surgery on a running patient.

That is why this part of the book exists as its own pillar. What does a storage decision buy us, and what does it cost? The buy is usually performance, scale, or a guarantee you can rely on. The cost is almost always flexibility you’ll wish you had later. Getting these trade-offs right early is the cheapest they will ever be.

Data outlives code

Three forces make storage choices uniquely sticky:

Volume. Once you have a terabyte of production data, “let’s just re-import it differently” is a multi-day operation with downtime risk, not a refactor.
Coupling. Every reader and writer encodes assumptions about your schema. A relational table with twelve consumers has twelve places that break when you rename a column.
Correctness. Code bugs corrupt a request; data bugs corrupt the record of truth. A bad write can outlive the developer who shipped it, silently poisoning reports and decisions for years.

The questions this part answers

Every storage system, from a single Postgres box to a globe-spanning datastore, is really a stack of answers to six recurring questions. This part walks them in dependency order.

   DATA MODEL      How do I shape records?         → SQL vs NoSQL, Data Modeling
        │
   INDEXING        How do I find them fast?        → B-Tree vs LSM
        │
   REPLICATION     How do I survive a dead node?   → leaders, followers, lag
        │
   PARTITIONING    How do I outgrow one machine?   → range/hash, shard keys
        │
   TRANSACTIONS    How do I keep it correct?       → ACID, isolation, 2PC

Roadmap

SQL vs NoSQL — the first fork. Relational guarantees (schema, joins, ACID) versus the flexibility and scale of non-relational stores. The real axis isn’t “SQL good, NoSQL bad” — it’s your access patterns and consistency needs.
Indexing (B-Tree vs LSM) — why an index makes reads fast and writes slower, and the two great families of index structure that power nearly every database alive.
Replication — copying data to many machines for availability and read throughput, and the consistency bill that copying always sends you.
Partitioning & Sharding — splitting one dataset across many machines once it no longer fits (or no longer keeps up) on one.
Transactions & ACID — the guarantees that let you treat a pile of writes as one indivisible, correct unit, and why those guarantees get expensive across machines.
Data Modeling — the craft that ties it together: normalization versus denormalization, and modeling for how you’ll read the data, not just how it’s shaped.

How this connects to the rest of the book

Data sits between the building blocks and the hard distributed-systems realities. If you want the broad survey of engines before going deep, start with the Databases field guide. When replication and partitioning force you to confront what “the truth” even means across machines, you’ll lean on Consistency Models and Consensus (Raft/Paxos). And when a single shard gets hammered, Hot Partitions & the Celebrity Problem waits at the end of the road.

The thread running through all six pages: distribution buys you scale and survival, and charges you in consistency, complexity, and operational pain. Every page is a different invoice for that same purchase.

Check your understanding

Why is a storage decision typically harder to reverse than a code decision? Name the three forces.
What is the “reversibility test,” and which storage decisions tend to fail it?
Put the six questions (data model, indexing, replication, partitioning, transactions, modeling) in dependency order and explain why that order makes sense.
In one sentence, what does distribution buy you across this whole part, and what does it cost?
Which page would you reach for first if your single database server is running out of disk, and why?