Skip to content

Indexing (B-Tree vs LSM)

Without an index, finding a row means scanning the whole table — checking every record until you find the one you want. That’s O(n), and on a billion rows it’s a coffee break. An index is a second, specially-organized copy of some of your data that turns that scan into a fast lookup. It’s the single biggest lever you have over database read performance.

But there is no free lunch, and indexing states the central trade-off of this whole part with unusual clarity: an index buys faster reads and charges you in slower writes and extra storage. Every write must now update the table and every index on it. The art of indexing is spending that write tax only where the read speedup is worth it.

Why an index speeds reads (and slows writes)

Section titled “Why an index speeds reads (and slows writes)”

Think of a book’s index. To find every mention of “Paxos,” you don’t read all 400 pages — you flip to the back, find “Paxos → p. 211, 230,” and jump. The index is a sorted structure that maps a key (the value you search by) to a location (where the full row lives).

Without index: SELECT * WHERE email = 'x' → scan all N rows (slow, O(N))
With index: email index → row location → one lookup (fast, O(log N))

The cost shows up on the other side. Every INSERT, UPDATE, or DELETE must keep that sorted structure correct, which means extra work and extra disk. Index a column you never search by and you’ve bought pure overhead. This is why “add an index” is the first tuning move and “you have too many indexes” is a real performance bug.

Family 1: the B-tree (read-optimized, update-in-place)

Section titled “Family 1: the B-tree (read-optimized, update-in-place)”

The B-tree (specifically the B+tree) has been the default index structure since the 1970s and powers Postgres, MySQL/InnoDB, SQL Server, and most relational databases. It keeps keys sorted in a shallow, wide tree of fixed-size pages (typically 4–16 KB), so any key is reachable in a handful of disk hops.

┌───────────────────────┐
│ [ 30 | 60 ] │ root (1 page)
└────┬───────┬───────┬───┘
┌─────────────┘ │ └─────────────┐
┌─────▼─────┐ ┌─────▼─────┐ ┌─────▼─────┐
│ [10 | 20] │ │ [40 | 50] │ │ [70 | 80] │ internal
└──┬──┬───┬─┘ └──┬──┬───┬─┘ └──┬──┬───┬─┘
▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼
┌──────────────────────────────────────────────────────┐
│ leaf pages: sorted keys → row pointers, linked left→right │
└──────────────────────────────────────────────────────┘

To find key 50, you start at the root (50 is between 30 and 60), descend to the right internal node, then to the leaf. Three page reads, billion rows or not — the tree is balanced and only a few levels deep. The leaves are linked in order, so range scans (“all keys between 40 and 70”) are cheap: find the start, walk sideways.

The defining trait: B-trees update in place. Changing a key overwrites the page where it lives. That makes reads predictable and ranges fast, but it means every write seeks to a specific spot on disk and rewrites a whole page — random I/O, which spinning disks hate and even SSDs prefer to avoid.

Family 2: the LSM-tree (write-optimized, append + compact)

Section titled “Family 2: the LSM-tree (write-optimized, append + compact)”

The Log-Structured Merge-tree flips the priority to make writes cheap. It’s behind Cassandra, RocksDB, LevelDB, HBase, and the storage engines under many modern key-value stores.

The trick: never seek to update in place; only ever append.

write → [ MemTable ] in-memory, sorted (fast)
│ when full, flush to disk as an immutable file
[ SSTable L0 ] [ SSTable L0 ] ... sorted files, never modified
│ background COMPACTION merges + sorts them
[ SSTable L1 (bigger, merged) ] → [ L2 ] → ...

Writes go to an in-memory sorted table (the memtable) plus an append-only log for durability — both sequential, both blazing fast. When the memtable fills, it’s flushed to disk as an immutable sorted file (an SSTable). Over time you accumulate many SSTables, so a background process called compaction merges and re-sorts them into fewer, larger files, discarding overwritten and deleted keys along the way.

The cost lands on reads. A key might live in the memtable, or any of several SSTables, so a read may have to check multiple files. LSM engines fight this with Bloom filters (a probabilistic “is this key definitely not here?” check that skips most files) and by keeping files sorted, but reads still do more work than a B-tree’s tidy descent.

The honest way to compare these families is through three “amplification” costs:

B-treeLSM-tree
Read amplification (extra reads per lookup)Low — one path down the treeHigher — may check several SSTables
Write amplification (extra writes per write)Page rewrites; moderateCompaction rewrites data many times
Space amplification (extra disk per byte)Some empty page spaceStale copies until compaction reclaims them

Covering indexes: skipping the table entirely

Section titled “Covering indexes: skipping the table entirely”

A normal index gets you the location of the row, then a second read fetches the row. A covering index includes the actual columns you need, so the query is answered from the index alone — the table is never touched.

Query: SELECT name FROM users WHERE email = ?
Plain index on (email): email → row location → read table → name (2 hops)
Covering index (email, name): email → name (right there in the index) (1 hop)

This buys a real read speedup for hot queries — at the cost of a fatter index (more storage, more write tax, because the included columns must be kept in sync). Like every indexing decision, it’s the same bargain in miniature: pay on write to win on read.

An index is a deliberate redundancy: a second copy of your data, organized for fast retrieval, kept in sync on every write. B-trees organize it for reading (sorted, update-in-place, great ranges); LSM-trees organize it for writing (append-only, compacted in the background). Covering indexes push the bargain further by answering queries without touching the table at all. In every case you are spending writes and storage to buy reads — so index the columns you search by, not the ones you merely store.

  1. Why does adding an index speed up reads but slow down writes? What exactly happens on each write?
  2. Walk through how a B-tree finds a key, and explain why range scans are cheap on it.
  3. What is the core trick of an LSM-tree, and what is compaction for?
  4. Define read, write, and space amplification. For each index family, which one is the weak spot?
  5. What is a covering index, and how does it restate the read-vs-write trade-off?