Designing Distributed Crawlers with Message Queues

Building a web crawler that handles a few hundred pages is a weekend project. Building one that reliably processes millions of pages per day, survives worker crashes, respects rate limits, deduplicates URLs across restarts, and gives you visibility into what it’s doing — that’s a distributed systems problem.

This post is about the design decisions I made when building the data ingestion pipeline for ScamIntelli, and what generalizes to crawler design broadly.

Why Message Queues

The naive crawler design is a recursive function: fetch a URL, extract links, enqueue them, repeat. This works until it doesn’t — when your single process crashes and you lose queue state, when you want to scale to multiple workers, when a slow domain blocks your entire pipeline.

Message queues decouple producers from consumers and give you durability for free. The design I use:

Frontier queue: URLs waiting to be fetched. Producers (link extractors) push here; fetchers consume.
Fetch results queue: Raw responses waiting to be parsed. Fetchers push here; parsers consume.
Graph events queue: Structured entities extracted from pages. Parsers push here; the graph builder consumes.

Each queue is a Redis Stream. Workers are Python processes running in Docker containers, orchestrated with docker-compose locally and Kubernetes in production.

Backpressure and Rate Limiting

Without backpressure, a fast producer will overwhelm a slow consumer. Redis Streams handle this gracefully — consumers acknowledge messages explicitly, and you can monitor lag per consumer group. If lag grows beyond a threshold, I pause producers and alert.

Rate limiting per domain is trickier. I maintain a token bucket in Redis for each domain (host:bucket key, refilled every N seconds). Before fetching, a worker atomically checks and decrements the bucket. If the bucket is empty, the URL is re-queued with a delay. This is a Lua script in Redis to keep the check-and-decrement atomic.

Fault Tolerance

Worker crashes: Redis Streams have a concept of “pending entries” — messages delivered to a consumer but not yet acknowledged. If a worker crashes, those messages stay in the pending list. A separate recovery process polls for old pending entries and re-delivers them to healthy workers.

Deduplication: I maintain a Redis Set of seen URLs (actually a Bloom filter for memory efficiency at scale). Before pushing to the frontier, check the filter. False positives mean some URLs get skipped — acceptable at the cost of a tiny miss rate.

Poison URLs: Some URLs cause workers to crash reliably (malformed content, infinite redirects). After N failed attempts, move the URL to a dead-letter queue for manual inspection.

Observability

A distributed crawler without visibility is a black box. I expose:

Per-domain fetch rates and error rates (Prometheus counters)
Queue depths for each stage (Prometheus gauges, scraped from Redis)
Worker health (heartbeat keys in Redis, TTL-based liveness)
A simple Grafana dashboard that shows pipeline throughput end-to-end

When something breaks at 2 AM, the dashboard tells you exactly which stage stalled and why.

Lessons

Start with one queue, add stages as needed. It’s easier to split a queue than merge two.
Make workers idempotent. Processing the same URL twice should produce the same result, not a duplicate.
Log correlation IDs. Trace a single URL’s journey through every stage. You will need this.
Test your recovery path. Kill workers deliberately and make sure pending messages get reprocessed. Do this before production, not during an incident.

The full pipeline design is documented in the ScamIntelli repository. The message queue patterns here generalize to any multi-stage data pipeline — crawlers, ETL jobs, event processing systems.