Partitioning & Sharding

Replication copies your whole dataset onto many machines. That solves availability and read scale, but it does nothing for two problems: a dataset bigger than one machine’s disk, and a write volume bigger than one machine’s CPU. For those you need partitioning (also called sharding): cut the dataset into pieces and put each piece on a different node.

What does partitioning buy us, and what does it cost? It buys near-unlimited horizontal scale — double the nodes, roughly double the capacity. It costs you the simplicity of a single machine: now you must decide which node holds each record, queries that span partitions get harder, and one unlucky partition can become a bottleneck that all your horizontal scale can’t fix.

Replication and partitioning are combined, not alternatives: each partition is itself replicated for survival.

   one big table        →   split into shards   →   each shard replicated
   ┌───────────────┐        ┌────┐ ┌────┐ ┌────┐    ┌────┐+copies  ...
   │  users 1..9M  │   →    │1-3M│ │3-6M│ │6-9M│ →  │1-3M│
   └───────────────┘        └────┘ └────┘ └────┘    └────┘
        (one node)            (three nodes)         (survives node loss)

The shard key: the most important decision

Every partitioned system hangs on one choice: the shard key (partition key) — the column whose value decides which node a record lives on. A good shard key spreads both data and traffic evenly, and matches your most common access pattern so reads usually hit a single shard. A bad shard key creates lopsided nodes and cross-shard queries that touch everything. There are two main ways to map a key to a partition.

Range partitioning

Assign contiguous ranges of the key to each partition: users A–F here, G–M there, N–Z over there; or timestamps Jan–Mar, Apr–Jun, and so on.

   key = last_name
   ┌──────────┐ ┌──────────┐ ┌──────────┐
   │  A – F   │ │  G – M   │ │  N – Z   │
   └──────────┘ └──────────┘ └──────────┘

Buys: efficient range scans. “All orders from last week” reads one or two adjacent partitions, because related keys sit together. Keys stay sorted within a partition.
Costs: hotspots. If you range-partition by timestamp, today’s partition takes every new write while the rest sit idle — a moving hot node. Sequential keys (auto-increment IDs) have the same disease: all new data piles onto the last partition.

Hash partitioning

Run the key through a hash function and assign by the hash value (often hash(key) mod N, or a slice of the hash range). The hash scatters even sequential keys uniformly.

   partition = hash(key) mod N
   user_42  → hash → 0x9f.. → partition 1
   user_43  → hash → 0x2a.. → partition 0   (neighbors land far apart — that's the point)

Buys: even distribution. Hashing destroys the ordering that causes hotspots, so writes and data spread uniformly even for sequential or skewed keys.
Costs: you lose efficient range scans — adjacent keys now live on random partitions, so “all orders last week” must query every partition (a scatter-gather).

Rebalancing without a stampede

When you add nodes (or one fills up), data must rebalance to the new layout. The naive hash(key) mod N has a brutal flaw: change N from 4 to 5 and almost every key remaps to a different node — a near-total reshuffle that saturates the network and tanks the cache.

The standard fixes:

Consistent hashing places nodes and keys on a ring so adding a node moves only the keys in one arc, not everything. Read that page for the full mechanism.
Fixed partition count — create many more partitions than nodes up front (say 1024 partitions on 4 nodes). Adding a node just reassigns whole partitions, never re-splits keys. This is how Elasticsearch and others keep rebalancing cheap and predictable.

Either way, the goal is the same: move as little data as possible when the cluster changes shape.

Hotspots: when one shard ruins the party

Even with hash partitioning, a single key can be hammered far more than others — a celebrity’s profile, a viral post, a flash-sale product. All requests for that key hit one partition, and no amount of total cluster capacity helps, because a single key can’t be split across nodes by a shard key alone.

   Normal:    requests spread evenly across shards   → all nodes ~equal
   Hot key:   90% of reads hit ONE key on ONE shard  → that node melts, others idle

This is the celebrity problem, and partitioning alone cannot solve it. The escapes — caching the hot key, adding a random suffix to split it, or replicating it specially — are the subject of Hot Partitions & the Celebrity Problem. The lesson here: a perfectly balanced partitioning scheme can still be wrecked by skewed access, and that skew often comes from the real world, not your code.

The thread

Partitioning is how data systems escape the limits of a single machine, scaling storage and write throughput by spreading records across nodes. The price is a web of new decisions: the shard key governs everything; range partitioning gives you cheap scans but courts hotspots; hash partitioning spreads load but kills ranges; and the cluster must rebalance without melting the network when it grows. Get the shard key right and you scale almost linearly; get it wrong — or get a celebrity key — and one overloaded shard turns all that horizontal capacity into wasted iron.

Check your understanding

Why doesn’t replication solve the problems partitioning solves? How are the two combined?
What makes a good shard key? Give one property that matters for data and one for traffic.
Contrast range and hash partitioning by what each buys and what each costs.
Why is naive hash(key) mod N bad for rebalancing, and what two techniques fix it?
Why can’t a shard key alone solve the celebrity/hot-key problem, even with perfect balancing?