Skip to content

Design a Chat System

A chat system (WhatsApp, Messenger, Slack) delivers messages between users in near-real-time, shows who’s online, and never silently loses a message. Unlike the read-heavy systems we’ve designed so far, chat is defined by stateful, long-lived connections and a hard delivery guarantee: a sent message must reach the recipient (eventually, even if they’re offline) and arrive in a sensible order. The central tension is that the web’s request/response model is pull-based, but chat needs push — and push means holding millions of open connections. We’ll use the framework.

Functional

  • 1:1 messaging and group chat.
  • Real-time delivery to online recipients; reliable delivery to offline ones (store-and-forward).
  • Presence (online/offline/last-seen) and typing indicators.
  • Delivery receipts (sent / delivered / read).
  • Message history / persistence.

Non-functional

  • Low latency — delivery in well under a second feels “instant.”
  • Durability — a sent message must not vanish; at-least-once delivery with dedup.
  • Ordering — messages within a conversation appear in a consistent order.
  • High availability and horizontal scale to hundreds of millions of concurrent connections.

Assume 500M daily active users, 50M concurrent at peak, 40 messages/user/day.

Messages: 500M × 40 / 86,400 ≈ 230,000 messages/sec (~700,000 peak)
Concurrent conns:50M open WebSocket connections at once
Conns per box: ~65,000 sockets/server (file-descriptor & memory bound)
Servers needed: 50M / 65,000 ≈ ~770 connection servers
Storage: 100 bytes/msg × 500M × 40 × 365 ≈ ~730 TB/year

The headline number isn’t QPS — it’s 50M simultaneous open connections. That single constraint forces a tier of dedicated connection servers and a way to route a message to whichever box holds the recipient’s socket. See real-time: polling, WebSockets, SSE.

To push a message to an online user, the server needs a channel it can write to at any moment.

Short polling client asks "anything new?" every few seconds → wasteful, laggy
Long polling request held open until data or timeout → better, still HTTP overhead
SSE server→client stream, one direction → great for feeds, not chat (no upstream)
WebSocket full-duplex, persistent TCP → chat's native transport ✓

Chat is bidirectional and latency-sensitive, so the answer is WebSocket: one persistent, full-duplex TCP connection per online client. Buys: instant push in both directions with minimal per-message overhead. Costs: the server is now stateful — it holds an open socket per user — which breaks the easy horizontal scaling of stateless app tiers.

WebSocket frames (over a persistent connection):
→ send_message { conversation_id, client_msg_id, content }
← message { message_id, sender_id, content, seq, ts }
← ack { client_msg_id, message_id, status: "sent" }
← receipt { message_id, status: "delivered" | "read" }
← presence { user_id, status: "online" | "offline" | last_seen }
REST (for non-real-time bits):
GET /conversations/{id}/messages?before={seq}&limit=50 ← history / sync
POST /conversations ← create group

The client_msg_id is the idempotency key: the client generates it, so a retried send doesn’t create a duplicate — the server dedups on it. This is how you get at-least-once + dedup = effectively exactly-once delivery.

messages conversations conversation_members
message_id (PK) conv_id (PK) conv_id (PK part)
conv_id (partition key) type (1:1 | group) user_id (PK part)
seq (per-conv counter) created_at last_read_seq
sender_id member_count
content
created_at

Messages are partitioned by conv_id so an entire conversation lives on one shard, sorted by a per-conversation sequence number seq. That seq is the ordering backbone: assign it from a per-conversation counter at write time and every client sorts by it, giving a single consistent order without relying on wall-clock timestamps (which skew across servers). Use a write-optimized store (Cassandra / HBase) — chat is write-heavy and queried by (conv_id, seq).

Clients ══WebSocket══► ┌──────────────────┐
│ Connection servers│ (hold the open sockets; stateful)
│ (~770 boxes) │
└────────┬─────────┘
│ who holds user X's socket?
┌─────────────┐ ┌──────────────────┐ ┌──────────────┐
│ Presence svc│◄────────│ Routing layer │───────►│ Message queue│
│ (Redis) │ │ + session store │ │ (durable) │
└─────────────┘ └────────┬─────────┘ └──────┬───────┘
│ ▼
▼ ┌──────────────┐
deliver to recipient's │ Message store│
connection server ◄────────│ (Cassandra) │
└──────────────┘

The flow for sending a message:

  1. Alice’s client sends a frame over her WebSocket to her connection server.
  2. The server persists the message (assigning seq) and writes it to the message store — durability first, so it survives even if delivery fails.
  3. It looks up which connection server holds Bob’s socket via a session registry (a Redis map of user_id → server). This is the routing problem the connection tier creates.
  4. It forwards the message — often via an internal message queue or pub/sub — to Bob’s connection server, which pushes it down Bob’s socket.
  5. Bob’s client sends a delivered receipt back up the same path; Alice gets the checkmark.
  • Offline users: if no socket is registered, skip delivery and rely on sync-on-reconnect, plus a push notification (APNs/FCM) via a notification system.
  • Presence at scale: clients send periodic heartbeats; the connection server updates a Redis TTL key. Absence of a heartbeat expires the key → offline. Broadcasting every presence change to every contact is expensive, so fan out presence only to users currently viewing that contact, and batch.
  • Group chat: for small groups, fan out the message to each member’s connection server (push). For large groups (Slack channels, broadcast lists), the fan-out is the celebrity problem again — pull on read or fan out via a dedicated group-message pipeline rather than N direct pushes.
  • Ordering: the per-conversation seq guarantees a total order within a conversation; cross- conversation ordering doesn’t matter to users.
  • Connection rebalancing: when a connection server dies, its ~65,000 clients reconnect (a thundering herd). A load balancer with connection draining and client backoff/jitter spreads the reconnect storm. Protect the reconnect endpoint with rate limiting.
  • Stateful connections vs stateless simplicity: WebSockets buy instant push but cost you the routing layer, the session registry, and harder failover.
  • Persist-then-deliver vs deliver-then-persist: durability-first adds a write to the hot path but guarantees no message is lost — the right call for chat.
  • Push vs pull for groups: push is instant but explodes for huge groups; pull bounds the work at the cost of read-time merging.
  • At-least-once + dedup vs exactly-once: true exactly-once is impractical across a network; the client_msg_id idempotency key gives the same user-visible result far more cheaply.
  1. Why are WebSockets the right transport for chat, and what new problem does holding stateful connections create?
  2. What is the routing problem, and how does a session registry solve “which server holds user X’s socket?”
  3. Why persist a message before attempting to deliver it? How does store-and-forward serve offline users?
  4. How does a per-conversation sequence number give consistent ordering without trusting wall clocks?
  5. Why does large-group chat reduce to the celebrity / fan-out problem, and what are the options?