Web Scraping & Crawling
What is a sitemap?
An XML file that lists every important URL on a site so search engines and crawlers can discover them efficiently.
A sitemap is a structured index of a site's public URLs, served from a path like /sitemap.xml. Each <url> entry usually carries the location, a lastmod timestamp, and an optional change frequency and priority. Large sites split their URL inventory into a sitemap index that points to multiple child sitemaps.
For SEO, sitemaps speed up the discovery of new and updated pages, Google does not need them, but it uses them when they exist. For programmatic use, sitemaps are the cleanest possible enumeration of a domain: they are exactly what a polite crawler should hit first before falling back to link-following.
Brand.dev's Sitemap Extractor API takes any domain, finds the sitemap (or builds one by parsing the index), and returns the deduplicated URL list, including handling of nested sitemap indexes, gzipped feeds, and robots.txt Sitemap directives.
In the wild
- →
/sitemap.xml, the canonical location - →
/sitemap_index.xml, an index pointing to per-section sitemaps (often used by WordPress) - →Image and video sitemaps for richer indexing of media-heavy sites
How Brand.dev uses sitemap
Endpoints in the Brand.dev API where this concept comes up directly.
FAQ
Do I need a sitemap?
If your site is small and well-linked, no, Google will find every page anyway. If you have orphaned pages, a large catalog, or content that updates frequently, a sitemap meaningfully accelerates indexing.
How do I find a site's sitemap?
Try /sitemap.xml first, then /sitemap_index.xml, then check robots.txt for a Sitemap: directive. Brand.dev's sitemap API does this lookup for you.
What's the difference between an XML and HTML sitemap?
XML sitemaps are for crawlers; HTML sitemaps are for humans navigating the site. They serve different audiences and rarely contain the same URLs.
Related terms
A program that systematically follows links between web pages to discover and index content at scale.
A plain-text file at the root of a domain that tells crawlers which paths they are allowed (or not allowed) to fetch.
The "official" URL for a piece of content when multiple URLs could return the same content, declared via `<link rel="canonical" href="…">`.
Programmatically extracting structured data from websites that were designed to be read by humans.