How to Discover Every URL on Any Domain (3 Methods Compared)

If you need to discover all pages on a website, you have three options: write a crawler from scratch, parse sitemaps manually, or use an API that does it for you. The trade-offs between them are bigger than most people expect.

This guide compares all three and shows which one actually works when you move past toy examples and start dealing with real websites at scale.


Why you need a complete URL inventory

Before getting into methods, it helps to understand where this problem actually comes up. It's more places than you'd think.

If you're running an SEO audit, you need to know every page that exists. Missing pages means missing broken links, orphan content, and indexation gaps. Competitive analysis is similar - a complete URL inventory tells you what a competitor is actually publishing, how deep their product catalog goes, and what regions they're targeting.

AI agents have the same dependency. A research agent, lead enrichment pipeline, or content analyzer all need a full list of URLs as a starting point. You can't analyze what you haven't found. (If you're evaluating tools for this, see our comparison of the top web scraping APIs for AI.)

Data pipelines that scrape product listings, job postings, or documentation all start with URL discovery too. So do site migrations - miss a page and you've created a broken link that loses traffic. Large organizations often don't even know what's on their own websites, since marketing teams, regional offices, and acquired companies all publish independently.

The point: if your URL list is incomplete, everything downstream is incomplete too.


Method 1: Build your own web crawler

Most developers start here. Write a crawler that begins at the homepage, follows every link it finds, and keeps going until there's nothing new. It's the obvious approach, and for small sites, it works fine.

How it works

A basic crawler follows this loop:

  1. Start with a seed URL (usually the homepage)
  2. Fetch the page's HTML
  3. Extract all internal links from anchor tags, navigation elements, and embedded references
  4. Add any new URLs to a queue
  5. Repeat until the queue is empty

A simplified Python version:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
from collections import deque
 
def crawl_domain(start_url, max_pages=1000):
    visited = set()
    queue = deque([start_url])
    domain = urlparse(start_url).netloc
 
    while queue and len(visited) < max_pages:
        url = queue.popleft()
        if url in visited:
            continue
 
        try:
            response = requests.get(url, timeout=10)
            visited.add(url)
            soup = BeautifulSoup(response.text, 'html.parser')
 
            for link in soup.find_all('a', href=True):
                full_url = urljoin(url, link['href'])
                parsed = urlparse(full_url)
                if parsed.netloc == domain and full_url not in visited:
                    queue.append(full_url)
        except Exception:
            continue
 
    return visited

This works. On simple, well-linked static sites, a basic crawler like this can discover most pages within minutes. The problem is that "simple, well-linked static sites" represent a shrinking fraction of the web.

Where it breaks down

The first problem is JavaScript. Most modern sites use React, Next.js, Vue, or similar frameworks that render client-side. A basic HTTP request gets you a shell HTML document with a JS bundle, not actual content or links. You need a headless browser (Puppeteer, Playwright) to render the DOM first, which adds an order of magnitude more complexity and memory usage.

Then there are crawl traps. Calendar pages that increment dates forever, faceted search URLs with combinatorial parameters, session IDs in URLs, pagination that loops. Without trap detection, your crawler never finishes.

Websites also fight back. Rate limits, CAPTCHAs, IP bans, Cloudflare, Akamai. Dealing with this means proxy rotation, request throttling, header spoofing, and retry logic. Each defense layer is more code to write and maintain.

There's also the politeness problem (respecting robots.txt, crawl delays, concurrent connection limits) and the legal question of whether you're even allowed to crawl a given site.

And crawlers have a hard ceiling: they can only find pages that are linked from other pages. Orphan pages - ad landing pages, old campaign URLs, pages that lost their internal links during a redesign - are invisible to any link-following approach.

Finally, there's scale. A 50-page marketing site is easy. A 500,000-page e-commerce site needs job queues, deduplication, headless browser pools, proxy management, and monitoring. At that point you're building infrastructure, not using data.

Verdict

You get full control, but you pay for it with engineering time. Fine for a one-off look at a small site. For production use across multiple domains, you'll spend more time maintaining crawler infrastructure than actually using the data.

Expect to capture 60-80% of discoverable URLs on most sites. JS-rendered content, orphan pages, and crawl traps account for the rest. A basic version takes a few hours to build; a production-grade crawler takes weeks to months.


Method 2: Parse sitemaps yourself

Instead of crawling, go straight to the source: the website's sitemap. A sitemap is an XML file (usually at /sitemap.xml) that lists the URLs a site owner wants search engines to index. When the sitemap exists and is well-maintained, it's faster and more complete than any crawler.

How it works

The basic process is:

  1. Check common sitemap locations (/sitemap.xml, /sitemap_index.xml, etc.)
  2. Check robots.txt for a Sitemap: directive
  3. Download and parse the XML
  4. If it's a sitemap index, recursively fetch each child sitemap
  5. Extract all <loc> tags to get the URL list

A basic implementation:

import requests
import xml.etree.ElementTree as ET
 
def get_sitemap_urls(domain):
    urls = []
    sitemap_url = f"https://{domain}/sitemap.xml"
 
    try:
        response = requests.get(sitemap_url, timeout=15)
        root = ET.fromstring(response.content)
        namespace = {'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
 
        # Check if it's a sitemap index
        sitemaps = root.findall('.//ns:sitemap/ns:loc', namespace)
        if sitemaps:
            for sitemap in sitemaps:
                child_response = requests.get(sitemap.text, timeout=15)
                child_root = ET.fromstring(child_response.content)
                for loc in child_root.findall('.//ns:url/ns:loc', namespace):
                    urls.append(loc.text)
        else:
            for loc in root.findall('.//ns:url/ns:loc', namespace):
                urls.append(loc.text)
    except Exception as e:
        print(f"Error: {e}")
 
    return urls

Much simpler than crawling. No link extraction, no JS rendering, no crawl queues. A comprehensive sitemap gives you a complete URL list in seconds.

Where it breaks down

Many websites don't have a sitemap at all. Smaller sites, legacy systems, and poorly maintained properties often skip sitemap generation entirely. No sitemap, no results.

Even when a sitemap exists, it's often incomplete. Some CMS platforms only include certain content types. Manually created pages, dynamically generated URLs, and pages in specific subdirectories might be excluded. A sitemap reflects what the site owner configured, not what actually exists on the server.

Sitemaps also aren't always at /sitemap.xml. They could be at /sitemap/sitemap-index.xml, /wp-sitemap.xml, or referenced only in robots.txt. Some use non-standard XML or plain text files with one URL per line. Your parser needs to handle all of this.

Large sites add more wrinkles: gzipped sitemaps (.xml.gz), deeply nested sitemap index files, malformed XML with unescaped ampersands and broken encoding. A naive parser will choke. And sitemaps go stale - a sitemap generated six months ago won't include anything published since.

If you're fetching sitemaps across thousands of domains, you also hit the same rate limiting and blocking issues as crawling. CDNs and WAFs throttle automated requests to sitemap files, especially from datacenter IPs.

Verdict

Faster and simpler than crawling, and when the sitemap is good, you get better results. But "when the sitemap is good" is doing a lot of work in that sentence. The failure rate across real-world domains - missing sitemaps, incomplete data, broken XML, non-standard paths - means you can't rely on this alone.

Completeness ranges from 0% (no sitemap) to 95%+ (well-maintained sitemap). You won't know which until you try. A basic version takes an hour or two; production-grade handling of all edge cases takes days.


Method 3: Use the Context.dev Sitemap API

The third option: skip building anything and call an API that does it for you.

Context.dev has a Sitemap API that takes a domain and returns a parsed list of every discoverable URL. One endpoint, structured JSON response.

How it works

The API is a single endpoint:

curl -X GET "https://api.context.dev/v1/sitemap?domain=stripe.com" \
  -H "Authorization: Bearer YOUR_API_KEY"

Or using the Node.js SDK:

import ContextDev from 'context-dev';
 
const client = new ContextDev({ apiKey: 'YOUR_API_KEY' });
 
const result = await client.brand.webScrapeSitemap({
	domain: 'stripe.com',
});
 
console.log(result.urls);
// Returns: ['https://stripe.com/', 'https://stripe.com/payments', ...]
console.log(result.meta);
// Returns: { sitemapsDiscovered: 12, sitemapsFetched: 12, ... }

The response includes the full URL list plus metadata about how many sitemaps were discovered and fetched, so you can gauge completeness.

What happens under the hood

It doesn't just check /sitemap.xml. The API checks every known sitemap location (/sitemap.xml, /sitemap_index.xml, /wp-sitemap.xml, CMS-specific paths, robots.txt), follows redirects, and handles HTTPS/HTTP variations.

If it finds a sitemap index, it recursively fetches every child sitemap. Sites like Amazon or Wikipedia with hundreds of nested sitemaps are handled automatically. Compressed .xml.gz files are decompressed transparently.

The parser handles real-world XML, not just spec-compliant XML. Unescaped characters, broken encoding, missing namespaces - all handled without failing the request. Requests are routed through infrastructure that avoids triggering WAFs, CAPTCHAs, and rate limiters, so you don't need your own proxy setup.

The returned URL list is deduplicated and normalized: trailing slashes standardized, parameters handled consistently, fragment identifiers stripped.

Why this works better in practice

A single API call returns results in seconds. For a domain with 10,000 pages, you get the full URL list in the time a crawler would process its first 50.

Because it combines multiple discovery methods and handles the edge cases described above, the API consistently returns more complete results than either DIY approach. It finds sitemaps a manual check would miss and parses XML that a basic parser can't handle.

You also don't maintain any infrastructure. No browser pools, no proxy lists, no XML parser debugging. The API returns data or an error - no silent failures or partial results that look complete but aren't.

Scale is straightforward: 10,000 domains is the same API call, 10,000 times. Your code doesn't change. And when you factor in engineering time, proxy costs, and ongoing maintenance, the API costs less than building it yourself.

Who uses this

SEO teams pull URL inventories at the start of client engagements to analyze content gaps and indexation issues. AI agent builders use it as the discovery layer - get the full URL map, then selectively scrape the pages the agent needs. Data engineering teams use it as the first step in ETL pipelines for product data, job postings, or ML datasets. Competitive intelligence platforms call it periodically to detect when competitors publish new pages.

Verdict

This is the most practical option for production use. It handles the edge cases that make DIY approaches fragile, and your code stays simple whether you're hitting one domain or ten thousand.

Completeness is 90-99% of publicly accessible URLs, with the most consistent results across diverse domains. Implementation takes minutes.


Head-to-head comparison

Here's how the three methods compare:

CriteriaDIY CrawlerSitemap ParsingContext.dev API
Setup timeDays to weeksHours to daysMinutes
URL completeness60-80%0-95% (variable)90-99%
JavaScript sitesRequires headless browserN/A (sitemap-based)Handled automatically
Handles edge casesOnly what you build forOnly what you code forComprehensive
Scale (multi-domain)Requires distributed infraRate limiting issuesSingle API call per domain
MaintenanceOngoingModerateNone
Anti-bot handlingProxy infrastructure neededLimited issuesBuilt-in
Orphan page detectionCannot detectDepends on sitemapDepends on sitemap
CostHigh (engineering time + infra)Medium (engineering time)Low (API pricing)
ReliabilityFragile at scaleModerateHigh

DIY approaches give you control at the cost of engineering time. The API trades that control for reliability and speed.


Which method should you use?

Build a crawler if you're learning, working on a school project, or need something specific like crawling authenticated pages behind a login. Just know you're signing up for ongoing maintenance.

Parse sitemaps yourself if you're doing a one-off analysis of a single domain that you've confirmed has a good sitemap, and you don't need this to generalize across arbitrary domains.

Use the Context.dev Sitemap API if you need this to work reliably across multiple domains, or if you're building a product or pipeline that depends on complete URL data. For most production use cases, this is the right call.

URL discovery is a means to an end. Nobody's goal is to build a great crawler - it's to use the URL data for SEO, AI workflows, data pipelines, or competitive analysis. The API lets you skip the plumbing.


Getting started with Context.dev

Setup takes a few minutes:

  1. Sign up at context.dev and grab your API key
  2. Install the SDK (if using Node.js): npm install context-dev
  3. Make your first call:
import ContextDev from 'context-dev';
 
const client = new ContextDev({ apiKey: process.env.CONTEXT_DEV_API_KEY });
 
const result = await client.brand.webScrapeSitemap({
	domain: 'example.com',
});
 
console.log(`Found ${result.urls.length} URLs`);
console.log(`Sitemaps discovered: ${result.meta.sitemapsDiscovered}`);

That's it.

Context.dev also provides brand data enrichment, logo retrieval, web scraping, and AI-powered data extraction. The Sitemap API is one piece of a larger platform for working with company web data programmatically.


Conclusion

Discovering every URL on a domain looks simple until you try it. Following links runs into JS rendering, crawl traps, and bot detection. Parsing sitemaps runs into missing files, broken XML, and stale data. Both require more infrastructure than you'd expect once you move past a single small site.

Each method has its place, but if you need this to work reliably across real-world websites, the API approach gives you the most complete results with the least maintenance. Context.dev handles the URL discovery so you can work on what you actually came here to build.

Context at scale

Join 5,000+ businesses using Context.dev to enrich their products with structured web data.