Web Scraping & Crawling

What is a sitemap?

Q: How do I find a site's sitemap?

Try `/sitemap.xml` first, then `/sitemap_index.xml`, then check `robots.txt` for a `Sitemap:` directive. Brand.dev's sitemap API does this lookup for you.

An XML file that lists every important URL on a site so search engines and crawlers can discover them efficiently.

A sitemap is a structured index of a site's public URLs, served from a path like /sitemap.xml. Each <url> entry usually carries the location, a lastmod timestamp, and an optional change frequency and priority. Large sites split their URL inventory into a sitemap index that points to multiple child sitemaps.

For SEO, sitemaps speed up the discovery of new and updated pages, Google does not need them, but it uses them when they exist. For programmatic use, sitemaps are the cleanest possible enumeration of a domain: they are exactly what a polite crawler should hit first before falling back to link-following.

Brand.dev's Sitemap Extractor API takes any domain, finds the sitemap (or builds one by parsing the index), and returns the deduplicated URL list, including handling of nested sitemap indexes, gzipped feeds, and robots.txt Sitemap directives.

In the wild

→/sitemap.xml, the canonical location
→/sitemap_index.xml, an index pointing to per-section sitemaps (often used by WordPress)
→Image and video sitemaps for richer indexing of media-heavy sites

How Brand.dev uses sitemap

Endpoints in the Brand.dev API where this concept comes up directly.

Sitemap Extractor API Website Crawler API

FAQ

Do I need a sitemap?

If your site is small and well-linked, no, Google will find every page anyway. If you have orphaned pages, a large catalog, or content that updates frequently, a sitemap meaningfully accelerates indexing.

How do I find a site's sitemap?

Try /sitemap.xml first, then /sitemap_index.xml, then check robots.txt for a Sitemap: directive. Brand.dev's sitemap API does this lookup for you.

What's the difference between an XML and HTML sitemap?

XML sitemaps are for crawlers; HTML sitemaps are for humans navigating the site. They serve different audiences and rarely contain the same URLs.

Related terms

Web Crawler

A program that systematically follows links between web pages to discover and index content at scale.

robots.txt

A plain-text file at the root of a domain that tells crawlers which paths they are allowed (or not allowed) to fetch.

Canonical URL

The "official" URL for a piece of content when multiple URLs could return the same content, declared via `<link rel="canonical" href="…">`.

Web Scraping

Programmatically extracting structured data from websites that were designed to be read by humans.

←All glossary terms

Ship an agent that actually knows things.

Free tier, 10-minute integration, and the same API powering agents at Mintlify, daily.dev, and Propane. No credit card to start.

Get API Access

Book a call