Web Scraping & Crawling

What is data extraction?

The process of pulling structured data out of unstructured or semi-structured sources like web pages, PDFs, or emails.

Also known as: information extraction, web data extraction

Data extraction sits between fetching a source and using its contents. The fetch produces raw HTML, a PDF, an email body; the extraction step isolates the values that matter (price, address, table rows) and emits them in a shape downstream code can consume. Historically this meant CSS selectors and regular expressions; increasingly it means an LLM call with a JSON schema.

Three flavors are common. Rule-based extraction uses hand-written selectors against a known template, fast and cheap but brittle. ML-based extraction trains on labeled examples, robust to layout drift but expensive to build. LLM-based extraction prompts a model with the page and a schema, expensive per call but flexible enough to handle unknown layouts. Real systems mix all three.

For brand intelligence, extraction is everywhere: pulling a logo URL out of <head> markup, recovering color tokens from an inline <style> block, finding a company's address on a contact page, parsing pricing tiers off a marketing page. The harder cases are the ones where the same fact is presented differently across sites, and the schema has to absorb that variance.

In the wild

  • Extracting { price, currency, availability } from 100,000 product pages
  • Pulling pricing tiers from a marketing page using an LLM with a Zod schema
  • Recovering company address and phone number from contact pages with mixed layouts

How Brand.dev uses data extraction

Endpoints in the Brand.dev API where this concept comes up directly.

FAQ

Is data extraction the same as web scraping?

Scraping is the broader term and includes the network fetch. Extraction is specifically the parse-and-isolate step, which also applies to PDFs, emails, and screenshots.

When should I use an LLM for extraction?

When the input layout is variable enough that maintaining selectors costs more than the per-call inference. For one stable template at high volume, selectors are cheaper.

How do I keep extractors from breaking?

Anchor on stable attributes (microdata, JSON-LD, semantic tags), monitor extraction success rate per source, and fail loud on schema mismatch rather than silently emitting nulls.

Related terms

Ship an agent that actually knows things.

Free tier, 10-minute integration, and the same API powering agents at Mintlify, daily.dev, and Propane. No credit card to start.