Web Scraping & Crawling

What is web scraping?

Programmatically extracting structured data from websites that were designed to be read by humans.

Web scraping is the practice of fetching a web page and pulling specific values out of it (prices, product details, contact info, article text) into a structured format your code can use. The page itself is usually HTML written for a browser, so the scraper has to parse the markup, walk the DOM, and isolate the parts that matter.

There are two flavors. Static scraping fetches the raw HTML over HTTP and parses it directly: fast, cheap, and sufficient for most server-rendered sites. Dynamic scraping uses a headless browser to execute JavaScript before reading the DOM: slower and more expensive, but necessary for SPAs, sites with client-side rendering, and pages that load content via XHR after the initial response.

Modern scraping pipelines also have to defeat anti-bot countermeasures. Rotating residential proxies, fingerprint randomization, captcha solvers, and adaptive backoff are now table-stakes for any scraper running at meaningful volume. This is why teams increasingly buy scraping as an API instead of building it themselves.

In the wild

  • Pulling competitor prices into a daily spreadsheet
  • Building an LLM training corpus from open-web pages
  • Enriching CRM records with company details extracted from each lead's website

How Brand.dev uses web scraping

Endpoints in the Brand.dev API where this concept comes up directly.

FAQ

Is web scraping legal?

Scraping public, non-copyrighted data is broadly defensible after the hiQ v. LinkedIn ruling, but you still have to respect terms of service, copyright, and rate-limit signals. Scraping personal data or content behind auth requires a much higher bar.

What's the difference between scraping HTML and scraping an API?

If a public API exists, use it. Scraping is what you do when the data you need is only available rendered into HTML, you accept the brittleness of CSS selectors in exchange for access.

How do I avoid getting blocked while scraping?

Throttle your request rate, set a real User-Agent, honor robots.txt, rotate IPs through residential proxies for sites that block datacenter ranges, and back off exponentially on 429s.

Related terms

Ship an agent that actually knows things.

Free tier, 10-minute integration, and the same API powering agents at Mintlify, daily.dev, and Propane. No credit card to start.