Web Scraping & Crawling

What is a web crawler?

A program that systematically follows links between web pages to discover and index content at scale.

Also known as: spider, bot, web spider

A web crawler, sometimes called a spider or bot, is software that automatically visits URLs, downloads each page, extracts the links it finds, and queues those links to visit next. The process repeats until the crawler exhausts the link graph or hits a limit on depth, page count, or time.

Search engines run crawlers continuously to keep their indexes fresh: Googlebot, Bingbot, and the AI training crawlers (GPTBot, ClaudeBot, Perplexity-User) all behave the same way at the wire level. Outside search, crawlers power price monitoring, SEO audits, archival, brand monitoring, and the data pipelines that feed retrieval-augmented LLMs.

A production-grade crawler has to handle a lot more than the loop above. It needs polite request scheduling, robots.txt compliance, sitemap parsing, JavaScript rendering for SPAs, retries with backoff, deduplication of canonicalized URLs, and a strategy for when the target site starts returning 429s or blocking the IP. Most teams underestimate how much of the engineering goes into the second list, not the first.

In the wild

  • Googlebot indexing a news site overnight
  • A SaaS product running a one-off crawl to build an internal docs search index
  • An AI agent crawling a domain to ground its answers in current product information

How Brand.dev uses web crawler

Endpoints in the Brand.dev API where this concept comes up directly.

FAQ

Is web crawling the same as web scraping?

Crawling is the discovery step, finding URLs by following links. Scraping is the extraction step, pulling specific data out of the pages a crawler fetches. Most real systems do both back-to-back.

Is web crawling legal?

Crawling publicly accessible content is generally legal in the US, but you must respect robots.txt directives, the site's terms of service, and applicable copyright and CFAA rulings. Crawling content behind authentication without permission is a different question entirely.

How fast should a crawler be?

Aim for the slowest rate that still finishes the job. A polite default is one request per second per host, slower for small sites. Faster than that and you will trigger rate limits, get IP-blocked, or knock over a small server.

Related terms

Ship an agent that actually knows things.

Free tier, 10-minute integration, and the same API powering agents at Mintlify, daily.dev, and Propane. No credit card to start.