Web Scraping & Crawling

What is a web crawler?

A program that systematically follows links between web pages to discover and index content at scale.

Also known as: spider, bot, web spider

A web crawler, sometimes called a spider or bot, is software that automatically visits URLs, downloads each page, extracts the links it finds, and queues those links to visit next. The process repeats until the crawler exhausts the link graph or hits a limit on depth, page count, or time.

Search engines run crawlers continuously to keep their indexes fresh: Googlebot, Bingbot, and the AI training crawlers (GPTBot, ClaudeBot, Perplexity-User) all behave the same way at the wire level. Outside search, crawlers power price monitoring, SEO audits, archival, brand monitoring, and the data pipelines that feed retrieval-augmented LLMs.

A production-grade crawler has to handle a lot more than the loop above. It needs polite request scheduling, robots.txt compliance, sitemap parsing, JavaScript rendering for SPAs, retries with backoff, deduplication of canonicalized URLs, and a strategy for when the target site starts returning 429s or blocking the IP. Most teams underestimate how much of the engineering goes into the second list, not the first.

In the wild

→Googlebot indexing a news site overnight
→A SaaS product running a one-off crawl to build an internal docs search index
→An AI agent crawling a domain to ground its answers in current product information

How Brand.dev uses web crawler

Endpoints in the Brand.dev API where this concept comes up directly.

Website Crawler API Sitemap Extractor API Web Scrape HTML API

FAQ

Is web crawling the same as web scraping?

Crawling is the discovery step, finding URLs by following links. Scraping is the extraction step, pulling specific data out of the pages a crawler fetches. Most real systems do both back-to-back.

Is web crawling legal?

Crawling publicly accessible content is generally legal in the US, but you must respect robots.txt directives, the site's terms of service, and applicable copyright and CFAA rulings. Crawling content behind authentication without permission is a different question entirely.

How fast should a crawler be?

Aim for the slowest rate that still finishes the job. A polite default is one request per second per host, slower for small sites. Faster than that and you will trigger rate limits, get IP-blocked, or knock over a small server.

Related terms

Web Scraping

Programmatically extracting structured data from websites that were designed to be read by humans.

robots.txt

A plain-text file at the root of a domain that tells crawlers which paths they are allowed (or not allowed) to fetch.

Sitemap

An XML file that lists every important URL on a site so search engines and crawlers can discover them efficiently.

Rate Limiting

A server-side policy that caps how many requests a client can make in a given window, returning 429 Too Many Requests when the cap is exceeded.

Proxy

A server that forwards your network requests, presenting its own IP address to the destination instead of yours.

Headless Browser

A real browser engine running without a visible UI, controlled programmatically through an automation API.

←All glossary terms