Web Scraping & Crawling
What is web scraping?
Programmatically extracting structured data from websites that were designed to be read by humans.
Web scraping is the practice of fetching a web page and pulling specific values out of it (prices, product details, contact info, article text) into a structured format your code can use. The page itself is usually HTML written for a browser, so the scraper has to parse the markup, walk the DOM, and isolate the parts that matter.
There are two flavors. Static scraping fetches the raw HTML over HTTP and parses it directly: fast, cheap, and sufficient for most server-rendered sites. Dynamic scraping uses a headless browser to execute JavaScript before reading the DOM: slower and more expensive, but necessary for SPAs, sites with client-side rendering, and pages that load content via XHR after the initial response.
Modern scraping pipelines also have to defeat anti-bot countermeasures. Rotating residential proxies, fingerprint randomization, captcha solvers, and adaptive backoff are now table-stakes for any scraper running at meaningful volume. This is why teams increasingly buy scraping as an API instead of building it themselves.
In the wild
- →Pulling competitor prices into a daily spreadsheet
- →Building an LLM training corpus from open-web pages
- →Enriching CRM records with company details extracted from each lead's website
How Brand.dev uses web scraping
Endpoints in the Brand.dev API where this concept comes up directly.
FAQ
Is web scraping legal?
Scraping public, non-copyrighted data is broadly defensible after the hiQ v. LinkedIn ruling, but you still have to respect terms of service, copyright, and rate-limit signals. Scraping personal data or content behind auth requires a much higher bar.
What's the difference between scraping HTML and scraping an API?
If a public API exists, use it. Scraping is what you do when the data you need is only available rendered into HTML, you accept the brittleness of CSS selectors in exchange for access.
How do I avoid getting blocked while scraping?
Throttle your request rate, set a real User-Agent, honor robots.txt, rotate IPs through residential proxies for sites that block datacenter ranges, and back off exponentially on 429s.
Related terms
A program that systematically follows links between web pages to discover and index content at scale.
A real browser engine running without a visible UI, controlled programmatically through an automation API.
A server that forwards your network requests, presenting its own IP address to the destination instead of yours.
A challenge-response test designed to distinguish humans from bots, usually presented as image, audio, or behavioral puzzles.
A server-side policy that caps how many requests a client can make in a given window, returning 429 Too Many Requests when the cap is exceeded.
The application protocol the web is built on, a simple request/response format for asking a server for a resource.