Web Scraping & Crawling
What is Scrapy?
A Python framework for building large-scale web crawlers, with batteries included for scheduling, retries, deduplication, and data pipelines.
Scrapy is what you reach for when "fetch a few pages with requests" stops scaling. You define a Spider class with start URLs and a parse method, yield more requests or extracted items from each response, and Scrapy handles the queue, concurrency, throttling, retries, redirects, cookies, and output serialization. A single Spider can crawl millions of URLs from a laptop.
The architecture is async-first (built on Twisted historically, with asyncio integration in recent versions) so the framework keeps thousands of requests in flight without blocking. Middlewares plug into request and response paths for things like proxy rotation, captcha handling, or custom retry logic; pipelines transform extracted items before they hit the database, JSON Lines file, or whatever sink you configure.
Scrapy is not a fit for everything. If you only need a few pages, a script with httpx and selectolax is simpler. If the target requires JavaScript rendering, you bolt on scrapy-playwright or pre-render upstream. But for breadth-first crawls of static or server-rendered sites, Scrapy is still the most productive option in Python.
In the wild
- →Running a nightly crawl of 50,000 supplier websites to refresh a product catalog
- →Building a vertical search index over a list of news domains
- →Scraping every page of a competitor's docs to feed an LLM's retrieval index
How Brand.dev uses scrapy
Endpoints in the Brand.dev API where this concept comes up directly.
FAQ
Scrapy vs requests + BeautifulSoup?
Use requests + BeautifulSoup for one-shot scripts. Move to Scrapy when you need concurrency, polite scheduling, retries, and a pipeline architecture, basically anything where you would otherwise reinvent those primitives.
Does Scrapy handle JavaScript?
Not natively. Pair it with scrapy-playwright or scrapy-splash to render pages, or scrape the underlying API the JS calls if one exists.
Is Scrapy good for production?
Yes, with monitoring. Run it under Scrapyd or a custom orchestrator, instrument the stats endpoint, and persist queue state if you need crash recovery.
Related terms
A program that systematically follows links between web pages to discover and index content at scale.
Programmatically extracting structured data from websites that were designed to be read by humans.
A Python library for parsing HTML and XML and extracting data from it using a friendly, forgiving API.
A query language for selecting nodes in an XML or HTML document using path expressions, widely used by scrapers when CSS selectors are not expressive enough.
A server-side policy that caps how many requests a client can make in a given window, returning 429 Too Many Requests when the cap is exceeded.
A server that forwards your network requests, presenting its own IP address to the destination instead of yours.