Databases: A Field Guide

A database is where your system’s truth lives — the durable record that survives restarts, crashes, and deploys. But “database” is not one thing. Over decades, different data shapes and access patterns produced different families of database, each making a deliberate trade to be excellent at something and mediocre at the rest. Choosing the wrong family is one of the most expensive mistakes in system design, because data is sticky: by the time you know it’s wrong, terabytes have piled up in the wrong shape.

This page is a map of the families and the question that selects each one: what is the shape of my data, and how will I query it? The deeper relational-vs-everything-else debate lives in SQL vs NoSQL; here we build the field guide.

What does choosing the right database buy us, and what does it cost? The right family buys natural modeling, fast queries on your actual access patterns, and the right consistency guarantees. The cost of every choice is the things that family gives up — there is no database that is best at everything, and any vendor who claims otherwise is selling you future pain.

The first split: relational vs the rest

For decades the relational database (SQL) was the only serious answer, and it’s still the right default for most systems. Everything else — the “NoSQL” families — emerged to handle a specific case where the relational model strained: extreme scale, a flexible schema, a graph-shaped query, or a firehose of timestamped data. So read the rest of this page as “here is the relational default, and here are the four shapes that justify reaching past it.”

The six families

   RELATIONAL    rows & columns, joins, ACID        →  the default; structured, related data
   KEY-VALUE     key → opaque blob                  →  fastest possible lookups by key
   DOCUMENT      key → rich JSON-like document       →  flexible schema, nested objects
   WIDE-COLUMN   row key → sparse columns, huge scale→  massive write/read at scale
   GRAPH         nodes & edges                       →  relationships ARE the query
   TIME-SERIES   timestamp → measurements            →  append-only metrics/events over time

Relational (PostgreSQL, MySQL)

Data lives in tables of rows and columns with a fixed schema; tables relate via foreign keys and are combined at query time with joins. Relational databases give you ACID transactions (atomic, consistent, isolated, durable) and a powerful, declarative query language (SQL). They are the right choice whenever your data is structured and related, you need correctness guarantees (money, inventory, anything that must never be half-updated), and your access patterns aren’t all known in advance — SQL lets you ask new questions later. The classic cost is that scaling writes horizontally is hard, and a rigid schema can chafe against fast-changing data.

Key-value (Redis, DynamoDB, Memcached)

The simplest model: a dictionary. You store a value under a key and get it back by that key — nothing else. Because the model is so constrained, key-value stores can be blisteringly fast and trivially partitioned by key. They’re ideal for caches, session stores, feature flags, and any “look it up by ID” workload. The cost is the flip side of the speed: you generally can’t query by value or do joins — if you don’t have the key, the store can’t help you.

Document (MongoDB, Couchbase)

A document store keeps self-contained, nested documents (typically JSON) under a key, and — unlike pure key-value — it can index and query the fields inside the document. The schema is flexible: different documents in the same collection can have different fields, which suits evolving data and aggregates that are naturally read together (a whole order with its line items in one document). The cost is that data spanning many documents (joins across entities) is awkward, and the flexible schema pushes the discipline of consistency onto your application code.

Wide-column (Cassandra, HBase, Bigtable)

Despite looking table-like, this family is really a giant, sparse, distributed map: a row key maps to a set of columns, and rows can have wildly different columns. It’s built for enormous scale and high write throughput across many machines, with tunable consistency. The defining constraint: you must design your tables around your queries up front, because there are no flexible joins or ad-hoc queries — the data layout is the query plan. Reach for it when you have write-heavy, massive-scale data with known access patterns (event logs, time-stamped activity at billions of rows).

Graph (Neo4j, Amazon Neptune)

Here the relationships are first-class: data is nodes connected by edges, and queries traverse those connections (“friends of friends who like X,” “the shortest path of approvals”). In a relational database, such queries become deep, slow, recursive joins; a graph database makes traversal the native, fast operation. Use it when the connections are the point — social graphs, recommendation networks, fraud rings, dependency graphs. The cost is that it’s a specialized tool, weaker for plain tabular reporting and harder to scale than simpler models.

Time-series (InfluxDB, TimescaleDB, Prometheus)

Optimized for data that is append-only and timestamped: metrics, sensor readings, events. These databases exploit the fact that writes arrive roughly in time order and old data is queried in ranges and rolled up — so they compress aggressively and make “average CPU per minute over the last 7 days” cheap. Use them for monitoring, IoT, and analytics over time. The cost is narrowness: they’re great at time-range queries and poor at general-purpose relational work.

Choosing: a decision sketch

   Is the data highly structured AND need strong correctness (transactions)?
        └─ yes → RELATIONAL (start here, default)

   Pure lookup by key, need raw speed (cache/session)?
        └─ KEY-VALUE

   Flexible, nested objects you read as a unit, schema in flux?
        └─ DOCUMENT

   Write-heavy at massive scale, queries known in advance?
        └─ WIDE-COLUMN

   The relationships themselves are what you query?
        └─ GRAPH

   A firehose of timestamped measurements?
        └─ TIME-SERIES

Check your understanding

Why is choosing the wrong database family so much more expensive than most other early mistakes?
What is the single defining capability of a document store that a pure key-value store lacks?
Why must you design wide-column tables around your queries up front, and what does that buy you?
Give a query that is slow in a relational database but native and fast in a graph database, and explain why.
Why is “relational by default, specialize with evidence” good advice, and what is the hidden cost of polyglot persistence?