Web Scraping & Crawling
What is robots.txt?
A plain-text file at the root of a domain that tells crawlers which paths they are allowed (or not allowed) to fetch.
robots.txt lives at https://example.com/robots.txt and uses the Robots Exclusion Protocol to publish rules per crawler. A typical file lists User-agent: lines naming a bot (or * for all bots), followed by Disallow: and Allow: rules that scope which URL paths that bot may visit.
It is a request, not an enforcement mechanism. Compliant crawlers (Googlebot, Bingbot, GPTBot, Claude-Web, your own well-behaved scraper) read robots.txt and obey it. Adversarial crawlers ignore it entirely. Sites still rely on it because the major search and AI players do follow it, and because it doubles as a clean signal of intent in any legal dispute.
In 2024 the file picked up a new neighbor: llms.txt, which signals which content a site is willing to expose to LLM crawlers and where the canonical Markdown lives. The two files solve different problems, but you will increasingly see them deployed together.
In the wild
- →
User-agent: GPTBot\nDisallow: /, block OpenAI's training crawler from the entire site - →
Sitemap: https://example.com/sitemap.xml, point any crawler at the sitemap - →
User-agent: *\nDisallow: /admin/, keep every bot out of the admin path
How Brand.dev uses robots.txt
Endpoints in the Brand.dev API where this concept comes up directly.
FAQ
Does Google obey robots.txt?
Yes. Googlebot fetches robots.txt before crawling and respects Disallow directives. Note that Google still indexes URLs it discovers via inbound links even if it cannot fetch the page itself, use a noindex meta tag if you need the page kept out of the index.
Is robots.txt a security control?
No. Anyone can read robots.txt, so listing a path there announces it exists. For anything sensitive, use authentication and server-side authorization, not a Disallow line.
What is Crawl-delay?
A non-standard directive some crawlers honor that asks for a minimum number of seconds between requests. Google ignores it; Bing and Yandex respect it.
Related terms
A program that systematically follows links between web pages to discover and index content at scale.
An XML file that lists every important URL on a site so search engines and crawlers can discover them efficiently.
Programmatically extracting structured data from websites that were designed to be read by humans.
A server-side policy that caps how many requests a client can make in a given window, returning 429 Too Many Requests when the cap is exceeded.