XML Sitemap Parsing at Scale: From 100 to 100,000 URLs

Parsing one XML sitemap is easy. You fetch the file, run it through an XML parser, and pull out the <loc> tags. Twenty lines of Python or Node.js, and you're done.

Parsing 10,000 sitemaps across thousands of domains, where some are gzipped, some are nested three levels deep in sitemap index files, some return JavaScript-rendered pages, and some are just broken XML, is a different problem entirely.

This guide covers what it takes to build an XML sitemap parser that works at scale. We'll go through the basics, hit every wall you'll run into between 100 and 100,000 URLs, and show how Context.dev's Sitemap API lets you skip all of it with a single endpoint.

The basics: parsing a single XML sitemap

Before we get into scale, let's start with the baseline. An XML sitemap follows the Sitemaps protocol, and the structure is simple:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/page-1</loc>
    <lastmod>2026-03-15</lastmod>
  </url>
  <url>
    <loc>https://example.com/page-2</loc>
    <lastmod>2026-02-20</lastmod>
  </url>
</urlset>

A minimal parser in Node.js using fast-xml-parser looks like this:

import { XMLParser } from 'fast-xml-parser';
 
async function parseSitemap(url: string): Promise<string[]> {
	const response = await fetch(url);
	const xml = await response.text();
	const parser = new XMLParser();
	const result = parser.parse(xml);
 
	const urls = result.urlset?.url;
	if (!urls) return [];
 
	return Array.isArray(urls) ? urls.map((u: any) => u.loc) : [urls.loc];
}

In Python, the equivalent with lxml:

import requests
from lxml import etree
 
def parse_sitemap(url: str) -> list[str]:
    response = requests.get(url, timeout=30)
    root = etree.fromstring(response.content)
    namespace = {'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
    return [loc.text for loc in root.findall('.//ns:loc', namespace)]

This works for a single, well-formed sitemap. But the moment you start hitting real websites at scale, every assumption in this code breaks.

Wall #1: Sitemap discovery

The first problem isn't parsing. It's finding the sitemap in the first place.

The protocol says sitemaps should live at /sitemap.xml. In practice, websites put them everywhere. Here are real patterns you'll encounter across thousands of domains:

/sitemap.xml (standard)
/sitemap_index.xml
/sitemap/sitemap-index.xml
/sitemaps/main.xml
/wp-sitemap.xml (WordPress)
/sitemap.php
/server-sitemap-index.xml (Next.js)
/page-sitemap.xml, /post-sitemap.xml (Yoast)

Your parser needs a discovery layer. The most reliable approach checks two sources:

1. robots.txt parsing. The Sitemap: directive in robots.txt is the most authoritative signal. But not every site includes it, and some robots.txt files contain multiple sitemap directives or point to sitemap index files.

async function findSitemapsFromRobotsTxt(domain: string): Promise<string[]> {
	try {
		const response = await fetch(`https://${domain}/robots.txt`);
		const text = await response.text();
		return text
			.split('\n')
			.filter((line) => line.toLowerCase().startsWith('sitemap:'))
			.map((line) => line.split(':', 2).slice(1).join(':').trim());
	} catch {
		return [];
	}
}

2. Common path fallback. If robots.txt doesn't help, you probe a list of known sitemap paths with HEAD requests. More HTTP requests, more latency, but necessary if you want decent coverage.

At scale, this discovery step alone can account for 40-60% of your total request volume. Every domain needs multiple probes before you even start parsing.

Wall #2: Sitemap index recursion

Large sites split their URL inventory across sitemap index files, which are sitemaps that point to other sitemaps (for a deeper primer on the format, see our guide on what a sitemap is and why it matters). Those child sitemaps can themselves be index files, so you end up with a recursive tree.

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap-products-1.xml</loc>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-products-2.xml</loc>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-blog.xml</loc>
  </sitemap>
</sitemapindex>

Your parser needs to differentiate between a <urlset> (contains actual URLs) and a <sitemapindex> (contains pointers to other sitemaps), then recursively resolve the full tree.

Here's where it gets tricky at scale:

Depth limits. You need a recursion depth limit to avoid infinite loops from circular references. Three to four levels is typically safe.
Fanout. A single sitemap index might point to 500 child sitemaps. If each of those is also an index, you're suddenly making 250,000 HTTP requests for a single domain.
Concurrency control. You can't fire all these requests at once without hammering the target server. You need per-domain rate limiting and connection pooling.
Partial failures. What happens when 3 out of 500 child sitemaps return 404? You need to decide whether to fail the entire parse or return partial results.

async function resolveAllUrls(url: string, depth: number = 0): Promise<string[]> {
	if (depth > 3) return [];
 
	const xml = await fetchAndParse(url);
 
	// It's a sitemap index — recurse
	if (xml.sitemapindex) {
		const children = Array.isArray(xml.sitemapindex.sitemap) ? xml.sitemapindex.sitemap : [xml.sitemapindex.sitemap];
 
		const results = await Promise.allSettled(children.map((child: any) => resolveAllUrls(child.loc, depth + 1)));
 
		return results.filter((r) => r.status === 'fulfilled').flatMap((r) => (r as PromiseFulfilledResult<string[]>).value);
	}
 
	// It's a urlset — extract URLs
	if (xml.urlset?.url) {
		const urls = Array.isArray(xml.urlset.url) ? xml.urlset.url : [xml.urlset.url];
		return urls.map((u: any) => u.loc);
	}
 
	return [];
}

This is the recursive resolution code. It handles the happy path. But production code needs timeout handling, retry logic, deduplication, and memory management on top of this.

Wall #3: Gzip compression

Large sitemaps are often served compressed. You'll encounter two patterns:

Gzipped files at .xml.gz URLs. The URL itself tells you it's compressed.
Gzip content-encoding on .xml URLs. The server compresses the response transparently via HTTP headers.

Your parser needs to handle both. The first case requires explicit decompression. The second is usually handled by your HTTP client's Accept-Encoding header, but not always. Some servers serve gzipped content without the correct Content-Encoding header, and you just get garbled data that fails XML parsing.

import { gunzipSync } from 'zlib';
 
async function fetchSitemapContent(url: string): Promise<string> {
	const response = await fetch(url);
	const buffer = Buffer.from(await response.arrayBuffer());
 
	// Check for gzip magic number (1f 8b)
	if (buffer[0] === 0x1f && buffer[1] === 0x8b) {
		return gunzipSync(buffer).toString('utf-8');
	}
 
	return buffer.toString('utf-8');
}

The gzip magic number check is a defensive pattern. Instead of trusting the URL extension or HTTP headers, you inspect the actual bytes. This catches the edge cases that break naive implementations.

At scale, gzip handling also affects memory management. A large uncompressed sitemap can be much smaller on the wire, then expand sharply during decompression. If you're processing hundreds of these concurrently, the memory spike can crash your process.

Wall #4: Malformed XML

The XML spec is strict. Real-world sitemaps are not. Here's what you'll actually see across thousands of domains:

Unescaped ampersands in URLs: & instead of &
Missing XML declarations or incorrect encoding headers
Namespace mismatches where the namespace is omitted or non-standard
BOM characters at the start of the file
HTML mixed into XML because the server returns a full HTML page with embedded XML when the sitemap is dynamically generated
Truncated files where the server cut the response mid-stream
UTF-8 encoding errors with invalid byte sequences

A strict XML parser rejects all of these. A production sitemap parser has to deal with them:

function sanitizeXml(raw: string): string {
	let xml = raw;
 
	// Strip BOM
	xml = xml.replace(/^\uFEFF/, '');
 
	// Strip content before the XML declaration or first tag
	const xmlStart = xml.indexOf('<?xml');
	const tagStart = xml.indexOf('<urlset');
	const sitemapStart = xml.indexOf('<sitemapindex');
	const start = Math.min(...[xmlStart, tagStart, sitemapStart].filter((i) => i >= 0));
	if (start > 0) xml = xml.substring(start);
 
	// Fix unescaped ampersands in URLs
	xml = xml.replace(/&(?!amp;|lt;|gt;|quot;|apos;|#)/g, '&amp;');
 
	return xml;
}

Even with sanitization, some sitemaps are so mangled that no XML parser will accept them. At that point you fall back to regex, pulling URLs directly from the raw text with <loc>(.*?)</loc>. Not pretty, but it works.

Wall #5: Rate limiting, blocking, and anti-bot measures

When you're hitting sitemaps across hundreds or thousands of domains, you're making a lot of HTTP requests. Web servers notice.

Rate limiting. Many servers return 429 status codes after a burst of requests. Your parser needs exponential backoff with per-domain tracking. A global retry strategy doesn't work because Domain A's rate limit has nothing to do with Domain B's.

IP blocking. Some servers block IPs that make too many requests too quickly. Running from a single server, you'll eventually get blocked by sites with aggressive bot detection. That pushes you into IP rotation, proxy pools, or residential proxy networks, which cost money and add complexity.

Cloudflare and bot detection. A growing number of sites sit behind Cloudflare or similar CDNs that serve challenge pages to automated requests. Your sitemap request returns a 200 OK, but the body is a JavaScript challenge, not XML. Without a headless browser to solve the challenge, you get no data.

Dynamic rendering. Some sites generate sitemaps through JavaScript. A standard HTTP GET returns an empty shell or a loading page. You need a headless browser (Puppeteer, Playwright) to render the page and get the actual XML. This comes up a lot with SPAs and headless CMS setups.

Each of these problems needs its own infrastructure: proxy management, headless browser pools, challenge-solving services, per-domain rate limiting databases. This is where the engineering cost of a DIY sitemap parser starts to spiral.

Wall #6: Memory and performance at scale

Parsing a large XML file into a DOM tree can eat far more memory than the raw file size suggests. Process a few large sitemaps concurrently and you'll blow through your memory budget fast. Streaming XML parsers fix this.

SAX-style parsers (like sax-js in Node.js or iterparse in Python's lxml) process XML as a stream of events instead of building a full in-memory tree:

from lxml import etree
 
def parse_sitemap_streaming(content: bytes) -> list[str]:
    urls = []
    namespace = 'http://www.sitemaps.org/schemas/sitemap/0.9'
 
    for event, element in etree.iterparse(
        io.BytesIO(content), events=('end',)
    ):
        if element.tag == f'{{{namespace}}}loc':
            urls.append(element.text)
            element.clear()  # Free memory as we go
 
    return urls

Beyond memory, there are performance bottlenecks at every layer:

DNS resolution. Thousands of domains means thousands of DNS lookups. Caching and pre-resolution help, but DNS can still be a bottleneck.
Connection overhead. TLS handshakes are expensive. Connection pooling with keep-alive helps for multiple requests to the same domain, but across thousands of domains, every request is a cold start.
Deduplication. Sitemaps frequently contain duplicate URLs, especially when a sitemap index has overlapping child sitemaps. At 100,000+ URLs, the deduplication set alone becomes a nontrivial data structure to manage in memory.

The build vs. buy inflection point

A basic sitemap parser takes a day to build. A production parser that handles all the edge cases above takes weeks. And then you maintain it forever: new anti-bot measures, parsing failures on sites you haven't seen before, proxy infrastructure, uptime monitoring.

The inflection point usually hits around 1,000 domains. Below that, a scrappy script with some error handling works fine. Above that, you're building infrastructure that has nothing to do with your actual product.

The Context.dev Sitemap API

Context.dev has a Sitemap API that handles everything above in a single API call. Give it a domain, get back a parsed, deduplicated list of URLs. Discovery, recursion, gzip, malformed XML, rate limiting, anti-bot bypass: all handled on Context.dev's infrastructure.

How it works

One API call, one endpoint:

curl -X GET "https://api.context.dev/v1/sitemap?url=https://example.com" \
  -H "Authorization: Bearer YOUR_API_KEY"

The response is a JSON array of URLs, deduplicated and normalized.

With the official SDK:

import BrandDev from 'brand-dev';
 
const client = new BrandDev();
 
const result = await client.brand.webScrapeSitemap({
	domain: 'shopify.com',
});
 
console.log(result.urls);
// ['https://shopify.com/', 'https://shopify.com/pricing', ...]
console.log(result.urls.length);
// 487

from brand_dev import BrandDev
 
client = BrandDev()
 
result = client.brand.web_scrape_sitemap(domain="shopify.com")
 
for url in result.urls:
    print(url)

That's it. The API returns deduplicated URLs, fully resolved from whatever sitemap structure the target domain uses.

What it handles under the hood

Every wall described above is a solved problem inside Context.dev's infrastructure. The API checks robots.txt, probes common sitemap paths, and follows redirects for discovery. It walks the full sitemap index tree for recursive resolution. Gzip is decompressed transparently. Broken XML is normalized, with fallback to text extraction. Proxy rotation, challenge solving, and headless rendering handle anti-bot measures. And the returned URL list is already deduplicated.

Real workflow patterns

The Sitemap API works best as a discovery layer feeding into other operations. A few common patterns:

Sitemap to full-site scrape. Get every URL, then feed them into Context.dev's Markdown scraping API for the content of each page as clean Markdown:

const sitemap = await client.brand.webScrapeSitemap({
	domain: 'competitor.com',
});
 
const productUrls = sitemap.urls.filter((url) => url.includes('/products/') || url.includes('/pricing'));
const pages = await Promise.all(productUrls.map((url) => client.brand.webScrapeMd({ url })));
 
// Now you have clean Markdown for the matching sitemap URLs
// Ready for RAG indexing, competitive analysis, or AI processing

Sitemap-powered competitive monitoring. Pull sitemaps for a set of competitors on a schedule. Diff the URL lists to detect new pages, removed pages, and structural changes:

const currentUrls = new Set((await client.brand.webScrapeSitemap({ domain })).urls);
const previousUrls = new Set(loadPreviousSnapshot(domain));
 
const newPages = [...currentUrls].filter((u) => !previousUrls.has(u));
const removedPages = [...previousUrls].filter((u) => !currentUrls.has(u));

Sitemap to brand enrichment. Combine the Sitemap API with Context.dev's brand data API to build company profiles with logos, colors, products, and page inventories:

const [sitemap, brand] = await Promise.all([client.brand.webScrapeSitemap({ domain: 'target.com' }), client.brand.retrieve({ domain: 'target.com' })]);
 
// sitemap.urls → full page inventory
// brand.brand.logos → company logos
// brand.brand.colors → brand colors
// brand.brand.industries → industry classification

Context.dev is the only API that gives you sitemap parsing, web scraping, and brand intelligence in a single platform.

Performance comparison: DIY vs. Context.dev

Here's what a typical DIY implementation looks like versus the Context.dev API for a batch of 1,000 domains:

Metric	DIY Implementation	Context.dev API
Setup time	2-4 weeks	5 minutes
HTTP requests per domain	5-20 (discovery + parsing)	1
Infrastructure needed	Proxy pool, headless browsers, queue, database	None
Handles gzip	If you build it	Yes
Handles malformed XML	If you build it	Yes
Anti-bot bypass	Requires proxy infrastructure	Built-in
Recursive resolution	If you build it	Automatic
Ongoing maintenance	Continuous	Zero
Success rate (across diverse domains)	60-80%	95%+

The success rate gap matters most. A DIY parser will fail silently on a chunk of domains because of anti-bot measures, unexpected formats, or infrastructure issues you haven't hit yet. Context.dev has already seen and fixed those failures across millions of domains.

When DIY still makes sense

There are cases where building your own parser is the right call:

Single domain, internal use. If you're parsing your own sitemap for internal monitoring, a simple script is fine. You control the format and there are no anti-bot issues.
Highly custom parsing logic. If you need to extract and process sitemap metadata (like <lastmod>, <changefreq>, or custom XML extensions) in domain-specific ways, a custom parser gives you full control.
Airgapped environments. If your infrastructure can't make external API calls, you'll need a self-hosted solution.

For everything else (multi-domain extraction, competitive intelligence, AI pipelines, SEO auditing at scale) the API approach wins on engineering time, reliability, and total cost.

Getting started

Here's how to go from zero to extracting URLs from any sitemap in under two minutes:

1. Get an API key. Sign up at context.dev and grab your API key from the dashboard.

2. Install the SDK.

# Node.js / TypeScript
npm install brand-dev
 
# Python
pip install brand-dev

3. Extract URLs from any domain's sitemap.

import BrandDev from 'brand-dev';
 
const client = new BrandDev(); // Uses BRAND_DEV_API_KEY env variable
 
const result = await client.brand.webScrapeSitemap({
	domain: 'stripe.com',
});
 
console.log(`Found ${result.urls.length} URLs`);
result.urls.forEach((url) => console.log(url));

4. Pipe into your workflow. Feed the URLs into scraping, analysis, indexing, or whatever your application needs.

XML sitemap parsing looks simple until it isn't. The gap between parsing one well-formed sitemap and reliably extracting URLs from thousands of real-world domains is filled with recursive index files, gzip, broken XML, anti-bot measures, and infrastructure overhead that has nothing to do with the problem you're actually trying to solve.

Context.dev's Sitemap API collapses that into a single API call. One endpoint, any domain, parsed and deduplicated URLs back in seconds.

If you're building something that needs sitemap data at scale, get your API key and try it. Two minutes from signup to your first parsed sitemap.