Context.dev

List Crawling: How to Extract Structured Data from Listings at Scale

List crawling is the process of turning listing pages into structured records: products from a category grid, jobs from a careers page, locations from a store finder, people from a directory, properties from a real-estate index, or resources from a documentation site. The technical challenge is not just fetching one page. It is discovering the right pages, following pagination, avoiding duplicate records, extracting fields consistently, and returning data in a shape your application can trust.

The old way to do this was to build a crawler, scrape HTML, write selectors for every field, handle JavaScript rendering, store raw pages, run a parser, and retry every failure path yourself. That still works in controlled environments, and this guide covers those options. But for most product teams, the better default is to send the target URL and the schema you want to Context.dev's structured website extraction API. The API crawls the relevant internal pages, uses your JSON Schema to decide what matters, and returns a typed data object with crawl metadata.

This guide is intentionally practical. You will see three plug-and-play Context.dev examples in Python, Ruby, and TypeScript. The TypeScript version uses Zod to define the output shape, generates JSON Schema from that Zod model, sends it to Context.dev, and then validates the returned data with the same Zod schema. After the managed approach, we will walk through the manual alternatives: sitemap crawling, URL pattern generation, HTML parsing, pagination, infinite scroll, detail-page enrichment, deduping, storage, retries, and scale limits.

What List Crawling Actually Means

People often use web crawling, web scraping, and list crawling interchangeably. They overlap, but they are not the same engineering problem.

TermScopeTypical output
Web crawlingTraverse pages by following linksURL graph, page corpus, crawl frontier
Web scrapingExtract data from pagesJSON, CSV, Markdown, screenshots, raw HTML
List crawlingExtract repeated records from listing/index pagesHomogeneous records like products, jobs, offices, events, or articles

A list crawler usually has two layers:

  1. Index extraction: Pull repeated records from listing pages. Examples: title, price, location, thumbnail, detail URL.
  2. Detail enrichment: Visit each detail page only when needed. Examples: long description, full specs, contact details, policies, photos, source citations.

The index layer is where you get breadth. The detail layer is where you get depth. A good crawler separates those concerns so it can deduplicate records before spending extra requests on detail pages.

When Context.dev Should Be Your First Option

Use Context.dev first when the output matters more than owning crawler infrastructure. That includes AI agents, enrichment pipelines, CRM imports, product monitoring, competitive research, onboarding flows, due-diligence tools, and any workflow where you need structured data from arbitrary websites.

The structured website extraction endpoint accepts:

  • url: the starting website URL.
  • schema: a JSON Schema object describing the data shape you want back.
  • instructions: plain-English extraction guidance.
  • maxPages: how many relevant pages to analyze, from 1 to 50.
  • maxDepth: optional link-depth control.
  • factCheck: when true, unsupported values are returned as null or empty instead of inferred.
  • followSubdomains: whether to crawl subdomains of the starting domain.
  • includeFrames, waitForMs, maxAgeMs, stopAfterMs, and timeoutMS for rendering, caching, and request control.

The response includes:

  • status: usually "ok" for a successful call.
  • url: the starting URL.
  • urls_analyzed: the actual pages used for extraction.
  • data: the object matching your schema.
  • metadata: crawl counts such as attempted, succeeded, failed, skipped, and max depth.
  • key_metadata: credit information when a valid API key is used.

That shape matters. In manual crawlers, engineers often treat extraction as a pile of strings. With Context.dev, you define the contract before the crawl starts. The crawler uses that contract to prioritize relevant pages and the application can validate the result before storing it.

Example Use Case: Crawl a Company Directory

To keep the samples concrete, imagine you want to extract companies from a public company directory. The output should be a list of companies with a name, description, location, batch or cohort, and URL. You also want a top-level name for the directory so the stored records are easy to trace later.

The schema is intentionally modest. You can add founders, industries, funding stage, headcount, social links, or source quotes later. Start with fields you will actually use.

Python: Raw HTTPS Request

This Python example uses only the standard library. It is easy to paste into a job, Lambda function, cron task, or notebook. Set CONTEXT_DEV_API_KEY before running it.

import json
import os
import urllib.request
 
API_URL = "https://api.context.dev/v1/web/extract"
 
schema = {
    "type": "object",
    "properties": {
        "directory_name": {
            "type": "string",
            "description": "The name of the directory or listing page."
        },
        "companies": {
            "type": "array",
            "description": "Companies found on the directory or listing pages.",
            "items": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "description": {"type": "string"},
                    "location": {"type": "string"},
                    "batch": {"type": "string"},
                    "url": {"type": "string"}
                },
                "required": ["name", "description", "location", "batch", "url"],
                "additionalProperties": False
            }
        }
    },
    "required": ["directory_name", "companies"],
    "additionalProperties": False
}
 
payload = {
    "url": "https://www.ycombinator.com/companies",
    "schema": schema,
    "instructions": (
        "Extract companies from the directory listing. "
        "Keep URLs absolute when possible. Use empty strings for missing text fields."
    ),
    "maxPages": 5,
    "maxDepth": 1,
    "factCheck": True
}
 
api_key = os.environ["CONTEXT_DEV_API_KEY"]
request = urllib.request.Request(
    API_URL,
    data=json.dumps(payload).encode("utf-8"),
    headers={
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json",
    },
    method="POST",
)
 
with urllib.request.urlopen(request, timeout=120) as response:
    result = json.loads(response.read().decode("utf-8"))
 
print(json.dumps(result["data"], indent=2))
print("URLs analyzed:", result.get("urls_analyzed", []))
print("Crawl metadata:", result.get("metadata", {}))

For production, wrap the request in a small helper that catches urllib.error.HTTPError, logs error_code, and retries only transient cases such as 408, 429, and 500. Do not retry validation errors without changing the input.

Ruby: Raw HTTPS Request

Ruby can call the same endpoint with net/http and json. This keeps the sample dependency-free and easy to run from Rails jobs, Rake tasks, or one-off scripts.

require "json"
require "net/http"
require "uri"
 
api_key = ENV.fetch("CONTEXT_DEV_API_KEY")
uri = URI("https://api.context.dev/v1/web/extract")
 
schema = {
  type: "object",
  properties: {
    directory_name: {
      type: "string",
      description: "The name of the directory or listing page."
    },
    companies: {
      type: "array",
      description: "Companies found on directory or listing pages.",
      items: {
        type: "object",
        properties: {
          name: { type: "string" },
          description: { type: "string" },
          location: { type: "string" },
          batch: { type: "string" },
          url: { type: "string" }
        },
        required: ["name", "description", "location", "batch", "url"],
        additionalProperties: false
      }
    }
  },
  required: ["directory_name", "companies"],
  additionalProperties: false
}
 
payload = {
  url: "https://www.ycombinator.com/companies",
  schema: schema,
  instructions: "Extract companies from the directory listing. Keep URLs absolute when possible.",
  maxPages: 5,
  maxDepth: 1,
  factCheck: true
}
 
request = Net::HTTP::Post.new(uri)
request["Authorization"] = "Bearer #{api_key}"
request["Content-Type"] = "application/json"
request.body = JSON.generate(payload)
 
response = Net::HTTP.start(uri.hostname, uri.port, use_ssl: true, read_timeout: 120) do |http|
  http.request(request)
end
 
unless response.is_a?(Net::HTTPSuccess)
  warn "Context.dev request failed: #{response.code} #{response.body}"
  exit 1
end
 
result = JSON.parse(response.body)
puts JSON.pretty_generate(result.fetch("data"))
puts "URLs analyzed: #{result.fetch("urls_analyzed", []).join(", ")}"
puts "Crawl metadata: #{JSON.generate(result.fetch("metadata", {}))}"

If you use Rails, persist result["data"]["jobs"] only after checking that the required keys exist. JSON Schema controls the API response shape, but your application should still treat external responses as untrusted at the boundary.

TypeScript: Zod Schema, JSON Schema, and Runtime Validation

In TypeScript, the cleanest pattern is to define your output once with Zod, convert it to JSON Schema for Context.dev, and parse the returned data with the same Zod model.

Install the dependencies:

npm install zod@^4

Then run this script with Node 22+ or with tsx:

import { z } from "zod";
 
const Company = z.object({
  name: z.string(),
  description: z.string(),
  location: z.string(),
  batch: z.string(),
  url: z.string(),
});
 
const CompanyDirectory = z.object({
  directory_name: z.string(),
  companies: z.array(Company),
});
 
const payload = {
  url: "https://www.ycombinator.com/companies",
  schema: z.toJSONSchema(CompanyDirectory),
  instructions:
    "Extract companies from the directory listing. Keep URLs absolute when possible. Use empty strings for missing text fields.",
  maxPages: 5,
  maxDepth: 1,
  factCheck: true,
};
 
async function main() {
  const apiKey = process.env.CONTEXT_DEV_API_KEY;
  if (!apiKey) {
    throw new Error("Set CONTEXT_DEV_API_KEY before running this script.");
  }
 
  const response = await fetch("https://api.context.dev/v1/web/extract", {
    method: "POST",
    headers: {
      Authorization: `Bearer ${apiKey}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify(payload),
  });
 
  if (!response.ok) {
    throw new Error(`Context.dev request failed: ${response.status} ${await response.text()}`);
  }
 
  const result = await response.json();
  const data = CompanyDirectory.parse(result.data);
 
  console.log(data.companies);
  console.log("URLs analyzed:", result.urls_analyzed);
  console.log("Crawl metadata:", result.metadata);
}
 
main().catch((error) => {
  console.error(error);
  process.exit(1);
});

This gives you two layers of protection. Context.dev receives a precise JSON Schema and returns data in that shape. Your TypeScript process then validates the returned value again before using it. If the website changes or you tighten the schema later, failures happen at the ingestion boundary instead of leaking bad records into your database.

If your application is still on Zod 3, use zod-to-json-schema for the conversion step:

import { zodToJsonSchema } from "zod-to-json-schema";
 
const payload = {
  url: "https://www.ycombinator.com/companies",
  schema: zodToJsonSchema(CareersListing),
};

For new projects, use Zod 4 and z.toJSONSchema.

How to Design a Good Extraction Schema

A Context.dev extraction is only as good as the schema you give it. The schema is not just validation. It is also the API's guide to what pages and fields matter.

Good list-crawling schemas are:

  • Specific: employment_type is better than misc.
  • Typed: use booleans and numbers when the downstream system needs booleans and numbers.
  • Nested only when useful: arrays of objects are great for records; deeply nested objects can make later storage harder.
  • Described: each important field should include a short description.
  • Strict: additionalProperties: false keeps output predictable.

For example, this product-listing schema is more useful than a generic scrape:

{
  "type": "object",
  "properties": {
    "products": {
      "type": "array",
      "description": "Products listed on category, collection, or product listing pages.",
      "items": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "price": { "type": "string" },
          "currency": { "type": "string" },
          "available": { "type": "boolean" },
          "product_url": { "type": "string" },
          "image_url": { "type": "string" }
        },
        "required": ["name", "price", "currency", "available", "product_url", "image_url"],
        "additionalProperties": false
      }
    }
  },
  "required": ["products"],
  "additionalProperties": false
}

Resist the temptation to ask for every field on day one. Start with the core record that makes your product work. Add fields once you know how they will be used, stored, displayed, and refreshed.

Tuning Context.dev for List Crawling

The most important parameters for list crawling are maxPages, maxDepth, instructions, and factCheck.

maxPages

Use maxPages to control breadth. The default is 5, and the hard cap is 50. For a small careers page, 3 to 5 pages may be enough. For a documentation portal, ecommerce category, or large directory, use more pages.

Good defaults:

ScenarioSuggested maxPages
One listing page with no detail pages1
Careers page with role detail pages5-10
Small directory or marketplace category10-25
Larger listing site where you need broad coverage25-50

maxDepth

Use maxDepth when the starting URL is broad. A depth of 0 means only the starting page. A depth of 1 means links from the starting page. For most list crawling, maxDepth: 1 or maxDepth: 2 is enough. Unlimited depth can be useful for discovery, but it can also spend pages on legal pages, blog posts, navigation links, and irrelevant content.

instructions

Instructions should resolve ambiguity. Tell the API which pages to prefer and how to treat missing values.

Good instruction:

Focus on product listing and product detail pages. Return only products currently shown as available. Keep product URLs absolute. Use an empty string when a visible value is missing.

Weak instruction:

Get products.

Instructions do not need to repeat the whole schema. Use them to describe prioritization, interpretation, filtering, and source constraints.

factCheck

Set factCheck: true when the output must be grounded in explicit page content. This is the right choice for prices, policies, job titles, office addresses, compliance clauses, source quotes, and anything stored as a fact in your product.

Leave factCheck false when you intentionally want reasonable inferences, such as ideal customer profile, positioning, competitor category, or recommended tags. For list crawling, most production ingestion should use factCheck: true.

Multiple List-Crawling Options

Context.dev is the recommended default because it collapses crawling, rendering, page selection, and extraction into one typed API call. Still, there are cases where a manual crawler is appropriate. Here are the main options.

Option 1: Managed Structured Extraction

Use Context.dev when:

  • The target site structure varies across customers.
  • You do not want to maintain selectors.
  • You need JavaScript rendering, PDF parsing, or internal link prioritization.
  • You want output directly shaped by JSON Schema.
  • You need to ship quickly and keep maintenance low.

Tradeoff: you pay per successful extraction call and you should tune maxPages so the API analyzes the pages that matter.

Option 2: Sitemap Discovery Plus Structured Extraction

Use a sitemap when:

  • The site publishes a complete sitemap.xml.
  • You can identify listing URLs by pattern.
  • You want to batch many known URLs through extraction.

The workflow is: fetch sitemap, filter URLs, call Context.dev for the most important listing or detail URLs, and store the typed results.

Option 3: Programmatic URL Generation

Use generated URLs when the site exposes predictable pagination:

  • /jobs?page=1
  • /products/shoes?page=2
  • /directory/california/san-francisco/page/3

This can be fast, but it is brittle. The moment the site changes pagination shape, your crawler breaks.

Option 4: HTML Selector Scraper

Use selectors when:

  • You control the target HTML.
  • The target has stable markup.
  • You need maximum speed and minimal cost for one known site.

This is the classic BeautifulSoup, Cheerio, Nokogiri, or Playwright approach. It is efficient, but every field becomes a maintenance responsibility.

Option 5: Browser Automation

Use Playwright or Puppeteer when:

  • Data appears only after JavaScript execution.
  • Infinite scroll has no accessible JSON endpoint.
  • You need to click filters or tabs before data appears.

This is the most expensive manual approach. It is slower, more failure-prone, and harder to scale, but sometimes necessary.

Manual Crawling: Anatomy of a Listing Page

If you do build manually, inspect the target before writing code. Every listing page has the same conceptual shape:

page
└── item container x N
    ├── title or name
    ├── summary fields
    ├── image or logo
    ├── tags or metadata
    └── detail URL

Identify:

  • The repeating container selector, such as article.product-card.
  • The fields inside the container.
  • The detail URL.
  • The pagination mechanism.
  • Whether content exists in server-rendered HTML, JSON embedded in the page, a backend API, or client-rendered DOM.

Do not start by coding the happy path. First, inspect malformed examples: sponsored cards, empty states, unavailable products, hidden promoted listings, missing images, and cards with different layouts. Those are what break crawlers in production.

Manual Option A: Sitemap Discovery in Python

Sitemaps are a good way to discover candidate listing pages. This script handles sitemap indexes and regular URL sets.

import urllib.request
from xml.etree import ElementTree
 
SITEMAP_URL = "https://example.com/sitemap.xml"
 
def fetch_xml(url: str) -> bytes:
    request = urllib.request.Request(url, headers={"User-Agent": "ListCrawler/1.0"})
    with urllib.request.urlopen(request, timeout=20) as response:
        return response.read()
 
def sitemap_urls(url: str) -> list[str]:
    root = ElementTree.fromstring(fetch_xml(url))
    namespace = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
    urls: list[str] = []
 
    for loc in root.findall(".//sm:loc", namespace):
        value = (loc.text or "").strip()
        if value.endswith(".xml"):
            urls.extend(sitemap_urls(value))
        else:
            urls.append(value)
 
    return urls
 
all_urls = sitemap_urls(SITEMAP_URL)
listing_urls = [url for url in all_urls if "/products/" in url or "/jobs/" in url]
print(f"Discovered {len(listing_urls)} candidate listing URLs")

Once you have the URL list, you can either scrape each URL manually or feed the best starting URL into Context.dev with a schema.

Manual Option B: Parse Repeating Containers

For a stable site, a selector scraper is straightforward. This example uses BeautifulSoup and keeps extraction defensive.

from dataclasses import dataclass
from urllib.parse import urljoin
import urllib.request
from bs4 import BeautifulSoup
 
BASE_URL = "https://example.com"
 
@dataclass
class Listing:
    title: str
    detail_url: str
    price: str
    image_url: str
 
def text_or_empty(node) -> str:
    return node.get_text(strip=True) if node else ""
 
def attr_or_empty(node, name: str) -> str:
    return node.get(name, "") if node else ""
 
def scrape_listing_page(url: str) -> list[Listing]:
    request = urllib.request.Request(url, headers={"User-Agent": "ListCrawler/1.0"})
    with urllib.request.urlopen(request, timeout=20) as response:
        html = response.read()
 
    soup = BeautifulSoup(html, "html.parser")
    records: list[Listing] = []
 
    for card in soup.select("article.product-card"):
        title = card.select_one(".product-title")
        link = card.select_one("a[href]")
        price = card.select_one(".price")
        image = card.select_one("img")
 
        href = attr_or_empty(link, "href")
        if not title or not href:
            continue
 
        records.append(
            Listing(
                title=text_or_empty(title),
                detail_url=urljoin(BASE_URL, href),
                price=text_or_empty(price),
                image_url=urljoin(BASE_URL, attr_or_empty(image, "src")),
            )
        )
 
    return records

This code is intentionally plain. The important part is the defensive pattern: every optional node is checked, URLs are normalized with urljoin, and malformed cards are skipped instead of crashing the whole crawl.

Pagination Patterns

Pagination is where list crawlers become crawlers instead of one-page scrapers.

Numbered Pages

Numbered pages are the easiest case. Increment until no records are returned.

import time
 
def crawl_numbered_pages(base_url: str, max_pages: int = 100) -> list[Listing]:
    all_records: list[Listing] = []
 
    for page in range(1, max_pages + 1):
        url = f"{base_url}?page={page}"
        records = scrape_listing_page(url)
        if not records:
            break
 
        all_records.extend(records)
        time.sleep(1.0)
 
    return all_records

Next Links

Some sites expose a next link in the body or head. Follow it until it disappears.

from urllib.parse import urljoin
 
def find_next_url(soup: BeautifulSoup, current_url: str) -> str | None:
    next_link = soup.find("a", rel="next")
    if not next_link:
        next_link = soup.find("a", string=lambda value: value and "next" in value.lower())
 
    href = next_link.get("href") if next_link else None
    return urljoin(current_url, href) if href else None

Load More and Infinite Scroll

For infinite scroll, open DevTools, click or scroll, and inspect the Network tab. Many sites load JSON from an internal endpoint. Calling that endpoint directly is usually cleaner than controlling a browser.

Use browser automation only after confirming there is no stable JSON endpoint. Browser automation should have explicit waits, screenshots during debugging, and strict max-page limits so it does not scroll forever.

Detail-Page Enrichment

Do not fetch detail pages until after deduplication. Listing pages often repeat records across category pages, search pages, featured sections, and tag pages. Detail requests are more expensive and slower, so spend them only on unique records.

The usual pipeline is:

  1. Extract all index records.
  2. Normalize the detail URL.
  3. Deduplicate.
  4. Fetch details for unique records only.
  5. Merge details back onto the index record.
  6. Store the final record.

For Context.dev, you can often avoid this manual step by setting maxPages high enough and writing instructions that tell the API to use detail pages when needed. For manual crawlers, detail enrichment is a separate queue.

Deduplication Strategy

Pick a stable key before you store data.

Record typeGood dedup key
ProductCanonical product URL or SKU
JobJob URL or applicant tracking system ID
LocationAddress normalized with city and postal code
CompanyDomain
ArticleCanonical URL
EventEvent URL plus start date

For URLs, normalize before comparing:

from urllib.parse import urlsplit, urlunsplit
 
def canonical_url(url: str) -> str:
    parsed = urlsplit(url)
    path = parsed.path.rstrip("/") or "/"
    return urlunsplit((parsed.scheme.lower(), parsed.netloc.lower(), path, "", ""))

Do not use titles alone. Titles change, collide, and often omit enough context to be dangerous.

Storage Choices

Choose storage based on how the data will be consumed.

StorageBest forNotes
CSVquick review, spreadsheet importFlat fields only, weak for nested arrays
JSONLappend-only pipelines, batch jobsOne JSON object per line, easy to stream
SQLitelocal analysis, small internal toolsGreat for prototypes and audits
PostgreSQLproduction app dataAdd unique constraints and ingestion metadata
DuckDBanalytics over large snapshotsExcellent for local columnar querying
Object storageraw page/archive retentionStore raw inputs separately from parsed records

Always store ingestion metadata: source URL, crawl timestamp, extraction version, schema version, and the URLs analyzed. That metadata is what lets you debug a bad record three months later.

Error Handling and Retries

For Context.dev, handle status codes deliberately:

  • 400: input validation or website access issue; inspect the error and change the request.
  • 401: missing or invalid API key.
  • 403: forbidden.
  • 408: request timeout; reduce scope or retry with a longer timeout.
  • 429: rate limited; back off.
  • 500: transient server issue; retry with exponential backoff.

For manual crawlers, also handle DNS errors, TLS errors, connection resets, malformed HTML, encoding problems, duplicate redirects, empty pages, and robots restrictions.

Use exponential backoff with jitter:

1s, 2s, 4s, 8s, 16s, capped at 60s, plus random jitter

Retries should be bounded. Infinite retry loops create duplicate work and can turn a temporary failure into an incident.

Politeness, Compliance, and Operational Safety

List crawling touches other people's infrastructure. Be deliberate.

  • Respect robots.txt and site terms where applicable.
  • Use a clear User-Agent for manual crawlers.
  • Keep concurrency low unless you have permission or an API contract.
  • Cache during development.
  • Do not collect personal data you do not need.
  • Prefer official APIs when the site provides them.
  • Log source URLs so records can be audited or deleted later.

Managed APIs reduce the crawling burden, but they do not remove your responsibility to use the data appropriately.

Production Architecture

A robust list-crawling pipeline usually has these components:

  1. Input queue: domains or starting URLs to process.
  2. Schema registry: versioned schemas for each use case.
  3. Extractor: Context.dev call or manual crawler.
  4. Validator: Zod, Pydantic, JSON Schema, or app-level validation.
  5. Normalizer: URL cleanup, currency parsing, date parsing, enum mapping.
  6. Deduper: stable key generation and uniqueness checks.
  7. Storage: database plus raw response archive when needed.
  8. Monitor: success rate, latency, failure codes, records per source.
  9. Review workflow: sample records for humans to inspect when schemas change.

The schema registry is worth calling out. If you change jobs[].location from a free-text string to an object with city, region, and country, that is a versioned contract change. Treat it like an API migration.

Quality Checks

Before trusting a crawler, test it against real edge cases:

  • A page with zero records.
  • A page with one record.
  • A page with promoted or sponsored cards.
  • A page with missing optional fields.
  • A page with relative URLs.
  • A page with duplicated records.
  • A page requiring JavaScript rendering.
  • A page with pagination ending earlier than expected.
  • A page that redirects.

For Context.dev, add checks around metadata.numSucceeded, metadata.numFailed, and urls_analyzed. If you expected detail-page coverage and urls_analyzed only contains the starting page, adjust maxPages, maxDepth, or instructions.

Schema Variants for Common Listing Types

The careers example above is only one shape. In practice, list crawling works best when every use case gets its own schema. Reusing one giant generic schema across products, jobs, locations, events, and directories sounds flexible, but it usually creates vague output. Build smaller schemas that reflect the record type you actually need.

For ecommerce category crawling, use a product-focused schema:

{
  "type": "object",
  "properties": {
    "collection_name": { "type": "string" },
    "products": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "brand": { "type": "string" },
          "price_text": { "type": "string" },
          "in_stock": { "type": "boolean" },
          "product_url": { "type": "string" }
        },
        "required": ["name", "brand", "price_text", "in_stock", "product_url"],
        "additionalProperties": false
      }
    }
  },
  "required": ["collection_name", "products"],
  "additionalProperties": false
}

For location crawlers, avoid treating addresses as one blob if you need search or routing later:

{
  "type": "object",
  "properties": {
    "locations": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "street": { "type": "string" },
          "city": { "type": "string" },
          "region": { "type": "string" },
          "postal_code": { "type": "string" },
          "phone": { "type": "string" },
          "url": { "type": "string" }
        },
        "required": ["name", "street", "city", "region", "postal_code", "phone", "url"],
        "additionalProperties": false
      }
    }
  },
  "required": ["locations"],
  "additionalProperties": false
}

For event calendars, make dates explicit strings and normalize them after extraction. Websites use too many date formats to assume every source will produce a clean timestamp on the first pass:

{
  "type": "object",
  "properties": {
    "events": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "title": { "type": "string" },
          "start_date_text": { "type": "string" },
          "venue": { "type": "string" },
          "city": { "type": "string" },
          "event_url": { "type": "string" }
        },
        "required": ["title", "start_date_text", "venue", "city", "event_url"],
        "additionalProperties": false
      }
    }
  },
  "required": ["events"],
  "additionalProperties": false
}

Each schema should mirror the first durable object in your application. If your application stores Product, extract products. If it stores OfficeLocation, extract office locations. If it stores HiringSignal, extract hiring signals instead of every detail from every job page. That keeps the crawl focused and makes the output easier to review.

Common Mistakes

The most common list-crawling mistakes are avoidable:

  • Asking for too many fields in the first schema.
  • Forgetting to include descriptions in JSON Schema properties.
  • Treating every page on a domain as equally relevant.
  • Fetching detail pages before deduplication.
  • Using title as the primary key.
  • Ignoring pagination termination conditions.
  • Running browser automation when a JSON endpoint exists.
  • Storing parsed records without source URLs.
  • Retrying 400-level validation errors.
  • Assuming the website will keep the same layout forever.

The fix is to design the ingestion boundary like any other production interface. Define the contract, validate inputs and outputs, log enough to debug, and keep each stage small.

Context.dev vs. Manual Crawling

Here is the pragmatic decision table:

NeedRecommended approach
Extract typed data from arbitrary websitesContext.dev structured extraction
Build an AI agent that reads web pagesContext.dev
Crawl a site where you do not control markupContext.dev
One stable internal siteManual selector scraper
Massive known URL list with simple HTMLManual crawler plus queue
JavaScript-heavy site with no APIContext.dev first, Playwright fallback
Strict typed TypeScript ingestionContext.dev plus Zod validation
Compliance-sensitive factsContext.dev with factCheck: true

The bias here is deliberate: most teams underestimate crawler maintenance. Selectors drift, pagination changes, sites add bot protection, JavaScript frameworks change hydration behavior, and edge cases multiply. Context.dev lets you spend time on the data contract and product workflow instead of crawler plumbing.

One more operational difference is reviewability. A manual crawler often hides its decisions inside selectors and parsing branches. A Context.dev extraction keeps the important decisions in the request: the starting URL, schema, instructions, crawl limits, and fact-checking mode. Those are much easier to review in code review, much easier to version, and much easier for non-crawler engineers to understand. When an extraction changes, you can diff the schema and instructions instead of reverse-engineering why a CSS selector stopped matching. That makes structured extraction a better fit for teams where web data is a product feature rather than a dedicated scraping department.

FAQ

What is list crawling?

List crawling is extracting repeated structured records from index or listing pages. Examples include product grids, job boards, location finders, directories, article indexes, marketplace results, and event calendars.

Is list crawling the same as scraping?

List crawling is a specific scraping pattern. General scraping can target any page. List crawling focuses on repeated item containers across one or more listing pages, often followed by optional detail-page enrichment.

Can Context.dev crawl multiple pages for one extraction?

Yes. The structured extraction API starts from a URL, follows relevant internal links, and returns urls_analyzed plus crawl metadata. Use maxPages and maxDepth to control scope.

Why use JSON Schema?

JSON Schema makes the desired output explicit. It tells the extractor what fields matter and gives your application a stable response contract. In TypeScript, define the schema with Zod, convert it to JSON Schema, and validate the returned data with the same Zod model.

Should I use factCheck: true?

Use factCheck: true for factual extraction: prices, job titles, locations, product specs, legal clauses, contact information, and source-grounded records. Leave it false only when your use case explicitly allows inference.

How many pages should I crawl?

Start with the smallest value that works. Use 1 for a single listing page, 5 to 10 for a careers page with detail pages, and 25 to 50 for broader directory-style extraction. Watch urls_analyzed and metadata to see whether the API is spending pages usefully.

What if the site uses infinite scroll?

With Context.dev, start with the rendered page and a clear schema. For manual crawlers, inspect the Network tab for an underlying JSON API. If none exists, use browser automation with strict limits.

How do I avoid duplicates?

Use canonical detail URLs, SKUs, job IDs, or domain names as stable keys. Normalize URLs by removing query strings and trailing slashes unless those parts identify distinct records.

Conclusion

List crawling sounds like a small scraping task until you need it to be reliable. Then you need page discovery, pagination, rendering, extraction, validation, deduplication, retries, storage, and monitoring.

For new builds, start with Context.dev's structured website extraction API. Define the data you want as JSON Schema, tune maxPages and instructions, validate the returned data, and store it with source metadata. Use manual crawling when the target is stable, narrow, and worth owning. For everything else, let Context.dev handle the crawl and keep your engineering effort focused on the product that consumes the data.

Ship an agent that actually knows things.

Free tier, 10-minute integration, and the same API powering agents at Mintlify, daily.dev, and Propane. No credit card to start.