Web Scraping & Crawling

What is robots.txt?

A plain-text file at the root of a domain that tells crawlers which paths they are allowed (or not allowed) to fetch.

robots.txt lives at https://example.com/robots.txt and uses the Robots Exclusion Protocol to publish rules per crawler. A typical file lists User-agent: lines naming a bot (or * for all bots), followed by Disallow: and Allow: rules that scope which URL paths that bot may visit.

It is a request, not an enforcement mechanism. Compliant crawlers (Googlebot, Bingbot, GPTBot, Claude-Web, your own well-behaved scraper) read robots.txt and obey it. Adversarial crawlers ignore it entirely. Sites still rely on it because the major search and AI players do follow it, and because it doubles as a clean signal of intent in any legal dispute.

In 2024 the file picked up a new neighbor: llms.txt, which signals which content a site is willing to expose to LLM crawlers and where the canonical Markdown lives. The two files solve different problems, but you will increasingly see them deployed together.

In the wild

→User-agent: GPTBot\nDisallow: /, block OpenAI's training crawler from the entire site
→Sitemap: https://example.com/sitemap.xml, point any crawler at the sitemap
→User-agent: *\nDisallow: /admin/, keep every bot out of the admin path

How Brand.dev uses robots.txt

Endpoints in the Brand.dev API where this concept comes up directly.

Website Crawler API Sitemap Extractor API

FAQ

Does Google obey robots.txt?

Yes. Googlebot fetches robots.txt before crawling and respects Disallow directives. Note that Google still indexes URLs it discovers via inbound links even if it cannot fetch the page itself, use a noindex meta tag if you need the page kept out of the index.

Is robots.txt a security control?

No. Anyone can read robots.txt, so listing a path there announces it exists. For anything sensitive, use authentication and server-side authorization, not a Disallow line.

What is Crawl-delay?

A non-standard directive some crawlers honor that asks for a minimum number of seconds between requests. Google ignores it; Bing and Yandex respect it.

Related terms

Web Crawler

A program that systematically follows links between web pages to discover and index content at scale.

Sitemap

An XML file that lists every important URL on a site so search engines and crawlers can discover them efficiently.

Web Scraping

Programmatically extracting structured data from websites that were designed to be read by humans.

Rate Limiting

A server-side policy that caps how many requests a client can make in a given window, returning 429 Too Many Requests when the cap is exceeded.

←All glossary terms

Ship an agent that actually knows things.

Free tier, 10-minute integration, and the same API powering agents at Mintlify, daily.dev, and Propane. No credit card to start.

Get API Access

Book a call