Design a Chat System
A chat system (WhatsApp, Messenger, Slack) delivers messages between users in near-real-time, shows who’s online, and never silently loses a message. Unlike the read-heavy systems we’ve designed so far, chat is defined by stateful, long-lived connections and a hard delivery guarantee: a sent message must reach the recipient (eventually, even if they’re offline) and arrive in a sensible order. The central tension is that the web’s request/response model is pull-based, but chat needs push — and push means holding millions of open connections. We’ll use the framework.
1. Requirements
Section titled “1. Requirements”Functional
- 1:1 messaging and group chat.
- Real-time delivery to online recipients; reliable delivery to offline ones (store-and-forward).
- Presence (online/offline/last-seen) and typing indicators.
- Delivery receipts (sent / delivered / read).
- Message history / persistence.
Non-functional
- Low latency — delivery in well under a second feels “instant.”
- Durability — a sent message must not vanish; at-least-once delivery with dedup.
- Ordering — messages within a conversation appear in a consistent order.
- High availability and horizontal scale to hundreds of millions of concurrent connections.
2. Back-of-envelope estimation
Section titled “2. Back-of-envelope estimation”Assume 500M daily active users, 50M concurrent at peak, 40 messages/user/day.
Messages: 500M × 40 / 86,400 ≈ 230,000 messages/sec (~700,000 peak)Concurrent conns:50M open WebSocket connections at onceConns per box: ~65,000 sockets/server (file-descriptor & memory bound)Servers needed: 50M / 65,000 ≈ ~770 connection serversStorage: 100 bytes/msg × 500M × 40 × 365 ≈ ~730 TB/yearThe headline number isn’t QPS — it’s 50M simultaneous open connections. That single constraint forces a tier of dedicated connection servers and a way to route a message to whichever box holds the recipient’s socket. See real-time: polling, WebSockets, SSE.
3. Transport: why WebSockets
Section titled “3. Transport: why WebSockets”To push a message to an online user, the server needs a channel it can write to at any moment.
Short polling client asks "anything new?" every few seconds → wasteful, laggyLong polling request held open until data or timeout → better, still HTTP overheadSSE server→client stream, one direction → great for feeds, not chat (no upstream)WebSocket full-duplex, persistent TCP → chat's native transport ✓Chat is bidirectional and latency-sensitive, so the answer is WebSocket: one persistent, full-duplex TCP connection per online client. Buys: instant push in both directions with minimal per-message overhead. Costs: the server is now stateful — it holds an open socket per user — which breaks the easy horizontal scaling of stateless app tiers.
4. API / protocol sketch
Section titled “4. API / protocol sketch”WebSocket frames (over a persistent connection): → send_message { conversation_id, client_msg_id, content } ← message { message_id, sender_id, content, seq, ts } ← ack { client_msg_id, message_id, status: "sent" } ← receipt { message_id, status: "delivered" | "read" } ← presence { user_id, status: "online" | "offline" | last_seen }
REST (for non-real-time bits): GET /conversations/{id}/messages?before={seq}&limit=50 ← history / sync POST /conversations ← create groupThe client_msg_id is the idempotency key: the client generates it, so a retried send doesn’t create
a duplicate — the server dedups on it. This is how you get at-least-once + dedup = effectively
exactly-once delivery.
5. Data model
Section titled “5. Data model”messages conversations conversation_members message_id (PK) conv_id (PK) conv_id (PK part) conv_id (partition key) type (1:1 | group) user_id (PK part) seq (per-conv counter) created_at last_read_seq sender_id member_count content created_atMessages are partitioned by conv_id so an entire conversation lives on one shard,
sorted by a per-conversation sequence number seq. That seq is the ordering backbone: assign it
from a per-conversation counter at write time and every client sorts by it, giving a single consistent
order without relying on wall-clock timestamps (which skew across servers). Use a write-optimized store
(Cassandra / HBase) — chat is write-heavy and queried by (conv_id, seq).
6. High-level design
Section titled “6. High-level design” Clients ══WebSocket══► ┌──────────────────┐ │ Connection servers│ (hold the open sockets; stateful) │ (~770 boxes) │ └────────┬─────────┘ │ who holds user X's socket? ▼ ┌─────────────┐ ┌──────────────────┐ ┌──────────────┐ │ Presence svc│◄────────│ Routing layer │───────►│ Message queue│ │ (Redis) │ │ + session store │ │ (durable) │ └─────────────┘ └────────┬─────────┘ └──────┬───────┘ │ ▼ ▼ ┌──────────────┐ deliver to recipient's │ Message store│ connection server ◄────────│ (Cassandra) │ └──────────────┘The flow for sending a message:
- Alice’s client sends a frame over her WebSocket to her connection server.
- The server persists the message (assigning
seq) and writes it to the message store — durability first, so it survives even if delivery fails. - It looks up which connection server holds Bob’s socket via a session registry (a Redis map of
user_id → server). This is the routing problem the connection tier creates. - It forwards the message — often via an internal message queue or pub/sub — to Bob’s connection server, which pushes it down Bob’s socket.
- Bob’s client sends a
deliveredreceipt back up the same path; Alice gets the checkmark.
Deep dives
Section titled “Deep dives”- Offline users: if no socket is registered, skip delivery and rely on sync-on-reconnect, plus a push notification (APNs/FCM) via a notification system.
- Presence at scale: clients send periodic heartbeats; the connection server updates a Redis TTL key. Absence of a heartbeat expires the key → offline. Broadcasting every presence change to every contact is expensive, so fan out presence only to users currently viewing that contact, and batch.
- Group chat: for small groups, fan out the message to each member’s connection server (push). For large groups (Slack channels, broadcast lists), the fan-out is the celebrity problem again — pull on read or fan out via a dedicated group-message pipeline rather than N direct pushes.
- Ordering: the per-conversation
seqguarantees a total order within a conversation; cross- conversation ordering doesn’t matter to users. - Connection rebalancing: when a connection server dies, its ~65,000 clients reconnect (a thundering herd). A load balancer with connection draining and client backoff/jitter spreads the reconnect storm. Protect the reconnect endpoint with rate limiting.
Key trade-offs
Section titled “Key trade-offs”- Stateful connections vs stateless simplicity: WebSockets buy instant push but cost you the routing layer, the session registry, and harder failover.
- Persist-then-deliver vs deliver-then-persist: durability-first adds a write to the hot path but guarantees no message is lost — the right call for chat.
- Push vs pull for groups: push is instant but explodes for huge groups; pull bounds the work at the cost of read-time merging.
- At-least-once + dedup vs exactly-once: true exactly-once is impractical across a network; the
client_msg_ididempotency key gives the same user-visible result far more cheaply.
Check your understanding
Section titled “Check your understanding”- Why are WebSockets the right transport for chat, and what new problem does holding stateful connections create?
- What is the routing problem, and how does a session registry solve “which server holds user X’s socket?”
- Why persist a message before attempting to deliver it? How does store-and-forward serve offline users?
- How does a per-conversation sequence number give consistent ordering without trusting wall clocks?
- Why does large-group chat reduce to the celebrity / fan-out problem, and what are the options?