Design a Web Crawler
A web crawler is a deceptively simple loop: fetch a page, extract its links, add them to a to-do list, repeat. Run that loop at the scale of the open web and every word in it becomes a hard problem. The to-do list has billions of entries. “Have I seen this URL?” must be answered billions of times. “Repeat” must not hammer a small blog into the ground. This walkthrough builds a crawler that is polite, distributed, and doesn’t drown in its own duplicates.
1. Requirements
Section titled “1. Requirements”Functional
- Start from seed URLs, fetch pages, extract and follow links (BFS-ish traversal).
- Deduplicate URLs so we don’t crawl the same page endlessly.
- Respect
robots.txtand politeness (don’t overload any single host). - Store fetched content for downstream use (search index, archive, training corpus).
- Support re-crawling — pages change, so revisit by priority/freshness.
Non-functional
- Scale: billions of pages; horizontally scalable workers.
- Politeness is non-negotiable — a rude crawler gets IP-banned and is a bad web citizen.
- Robustness: handle timeouts, redirects, traps (infinite calendars), malformed HTML.
- Extensibility: pluggable parsers and filters.
2. Estimation
Section titled “2. Estimation”Target 1B pages/month → ~400 pages/sec average sustained. At ~100 KB of HTML per page that’s ~100 TB/month of raw content before dedup and compression. The frontier (URLs queued to crawl) can hold tens of billions of URLs — far too big for one machine’s memory, so it lives on disk, sharded.
The URL-seen set is the sneaky cost: tracking “have I crawled this?” for tens of billions of URLs, checked on every extracted link. Storing full URLs is prohibitive — this is where we get clever (see deep dives).
3. API sketch
Section titled “3. API sketch”A crawler is mostly internal, but the control surface looks like:
POST /v1/seeds { "urls": [...], "priority": 5 } // inject starting pointsGET /v1/status → { pages_crawled, frontier_size, fetch_rate }POST /v1/recrawl { "host": "example.com", "max_age": "7d" }
Internal queue contract (frontier): enqueue(url, host, priority, earliest_fetch_time) dequeue() → next URL whose host is due (politeness-aware)4. Data model
Section titled “4. Data model”frontier url, host, priority, depth, earliest_fetch_time, enqueued_aturl_seen url_hash // membership only — Bloom filter + storecontent url, content_hash, fetched_at, status, body_ref (→ blob store)host_state host, last_fetch_at, crawl_delay, robots_rulescontent_seen content_hash // near-dup detection across URLsTwo separate “seen” concepts matter: url_seen (“have I queued/crawled this URL?”) and
content_seen (“have I seen this body before, under a different URL?”). Mirror sites and URL
parameters produce the same content at many addresses; catching that saves enormous storage.
5. High-level design
Section titled “5. High-level design” seeds ──► ┌─────────────────────────────────────────────┐ │ FRONTIER (sharded) │ │ priority lanes × per-HOST politeness queue │ └───────────────┬─────────────────────────────┘ dequeue host-due URLs ┌───────────────┬────┴───────┬───────────────┐ ▼ ▼ ▼ ▼ fetcher worker fetcher fetcher fetcher (stateless pool) │ robots.txt check + rate limit per host ▼ DNS resolve → HTTP GET → response │ ├──► content store (blob) + content_hash dedup │ ▼ parser: extract links, normalize URLs │ ▼ URL filter ──► URL-SEEN check (Bloom filter) ──new?──► enqueue back to frontier │seen └─ dropThe loop is a cycle: frontier → fetch → parse → filter → frontier. The two valves that keep it sane are the per-host politeness queue in the frontier (controls rate) and the URL-seen filter before re-enqueue (controls duplication).
What does this buy us, and what does it cost? Sharding the frontier and running a stateless fetcher pool buys near-linear horizontal scale — add workers, crawl faster. The cost is coordination: the politeness guarantee is global (“never hit example.com more than once a second”) but the workers are distributed, so two workers must not both grab example.com URLs at the same instant. We solve this by partitioning the frontier by host — all of a host’s URLs live in one shard, so one queue enforces that host’s rate. We buy scale by giving up the simplicity of a single global queue.
6. Scaling and deep dives
Section titled “6. Scaling and deep dives”The frontier as a politeness engine. Model it as priority lanes (important pages first)
crossed with per-host queues. A URL is only dequeued when its host’s earliest_fetch_time has
passed (now + crawl_delay). This naturally spreads load: high-priority pages jump the queue, but no
host ever gets crawled faster than its delay allows.
URL-seen dedup with a Bloom filter. Checking “have I seen this URL?” against tens of billions of entries on every link is the hottest path in the system. Storing all URLs in a hash set would cost terabytes of RAM. Instead, a Bloom filter answers membership in a few bits per URL with a tunable false-positive rate.
Bloom filter: "definitely new" or "probably seen" - never a false negative → we never re-crawl something it says is seen... wait: - false POSITIVE means: we occasionally skip a genuinely-new URL (acceptable) - false NEGATIVE never happens → we never crawl a true duplicate twiceA false positive means we miss a new page (tolerable at web scale); we never waste a fetch on a true duplicate. This trade — a little recall for a massive memory saving — is exactly what Probabilistic Data Structures are for.
Politeness in detail. Beyond robots.txt, respect Crawl-delay, cap concurrent connections per
host, and identify yourself with a real User-Agent. Back off on 429/503. A crawler that ignores
these gets blocked and poisons the well for everyone.
Crawler traps & normalization. Infinite calendars, session-id query params, and faceted filters generate unbounded near-identical URLs. Defenses: URL normalization (strip tracking params, sort query keys), depth limits, per-host page caps, and content-hash dedup to catch the same body behind many URLs.
Distributed coordination. Shard frontier and seen-set by host hash so each crawler “owns” a slice of hosts — this makes politeness a local decision (one shard = one host’s rate) instead of a global lock. DNS is cached aggressively; it’s a surprising bottleneck at 400 fetches/sec.
Freshness / re-crawl. Pages change at wildly different rates. Prioritize re-crawls by observed change frequency (news hourly, archives rarely) so crawl budget goes where content actually moves.
7. Key trade-offs
Section titled “7. Key trade-offs”Decision Buys you Costs you─────────────────────────────────────────────────────────────────────Frontier sharded by host local politeness, no uneven load (a giant global lock host loads one shard)Bloom filter for tiny memory at billions rare skipped new URLsURL-seen of URLs (false positives)Per-host rate limit don't get IP-banned; slower max throughput good citizenship per hostContent-hash dedup skip mirror/dup bodies extra hashing + storeStateless fetcher pool horizontal scaling coordination via frontierThe throughline: a web crawler is a politeness-constrained dedup machine. The naive loop is easy; the real system is the frontier that throttles per host and the probabilistic seen-set that lets you remember billions of URLs without billions of dollars of RAM.
Check your understanding
Section titled “Check your understanding”- Why is politeness — not raw speed — the constraint that most shapes the architecture?
- Why is the frontier partitioned by host, and what does that buy over a single global queue?
- A Bloom filter for the URL-seen set has false positives but no false negatives. What concrete behavior does each mean for the crawler, and why is the error in the “safe” direction?
- Distinguish
url_seenfromcontent_seenand give an example where the second saves real storage. - Name two kinds of crawler trap and the defenses that contain them.