Web Scraping & Crawling

What is Scrapy?

A Python framework for building large-scale web crawlers, with batteries included for scheduling, retries, deduplication, and data pipelines.

Scrapy is what you reach for when "fetch a few pages with requests" stops scaling. You define a Spider class with start URLs and a parse method, yield more requests or extracted items from each response, and Scrapy handles the queue, concurrency, throttling, retries, redirects, cookies, and output serialization. A single Spider can crawl millions of URLs from a laptop.

The architecture is async-first (built on Twisted historically, with asyncio integration in recent versions) so the framework keeps thousands of requests in flight without blocking. Middlewares plug into request and response paths for things like proxy rotation, captcha handling, or custom retry logic; pipelines transform extracted items before they hit the database, JSON Lines file, or whatever sink you configure.

Scrapy is not a fit for everything. If you only need a few pages, a script with httpx and selectolax is simpler. If the target requires JavaScript rendering, you bolt on scrapy-playwright or pre-render upstream. But for breadth-first crawls of static or server-rendered sites, Scrapy is still the most productive option in Python.

In the wild

  • Running a nightly crawl of 50,000 supplier websites to refresh a product catalog
  • Building a vertical search index over a list of news domains
  • Scraping every page of a competitor's docs to feed an LLM's retrieval index

How Brand.dev uses scrapy

Endpoints in the Brand.dev API where this concept comes up directly.

FAQ

Scrapy vs requests + BeautifulSoup?

Use requests + BeautifulSoup for one-shot scripts. Move to Scrapy when you need concurrency, polite scheduling, retries, and a pipeline architecture, basically anything where you would otherwise reinvent those primitives.

Does Scrapy handle JavaScript?

Not natively. Pair it with scrapy-playwright or scrapy-splash to render pages, or scrape the underlying API the JS calls if one exists.

Is Scrapy good for production?

Yes, with monitoring. Run it under Scrapyd or a custom orchestrator, instrument the stats endpoint, and persist queue state if you need crash recovery.

Related terms

Ship an agent that actually knows things.

Free tier, 10-minute integration, and the same API powering agents at Mintlify, daily.dev, and Propane. No credit card to start.