Web Scraping & Crawling
What is data extraction?
The process of pulling structured data out of unstructured or semi-structured sources like web pages, PDFs, or emails.
Also known as: information extraction, web data extraction
Data extraction sits between fetching a source and using its contents. The fetch produces raw HTML, a PDF, an email body; the extraction step isolates the values that matter (price, address, table rows) and emits them in a shape downstream code can consume. Historically this meant CSS selectors and regular expressions; increasingly it means an LLM call with a JSON schema.
Three flavors are common. Rule-based extraction uses hand-written selectors against a known template, fast and cheap but brittle. ML-based extraction trains on labeled examples, robust to layout drift but expensive to build. LLM-based extraction prompts a model with the page and a schema, expensive per call but flexible enough to handle unknown layouts. Real systems mix all three.
For brand intelligence, extraction is everywhere: pulling a logo URL out of <head> markup, recovering color tokens from an inline <style> block, finding a company's address on a contact page, parsing pricing tiers off a marketing page. The harder cases are the ones where the same fact is presented differently across sites, and the schema has to absorb that variance.
In the wild
- →Extracting
{ price, currency, availability }from 100,000 product pages - →Pulling pricing tiers from a marketing page using an LLM with a Zod schema
- →Recovering company address and phone number from contact pages with mixed layouts
How Brand.dev uses data extraction
Endpoints in the Brand.dev API where this concept comes up directly.
FAQ
Is data extraction the same as web scraping?
Scraping is the broader term and includes the network fetch. Extraction is specifically the parse-and-isolate step, which also applies to PDFs, emails, and screenshots.
When should I use an LLM for extraction?
When the input layout is variable enough that maintaining selectors costs more than the per-call inference. For one stable template at high volume, selectors are cheaper.
How do I keep extractors from breaking?
Anchor on stable attributes (microdata, JSON-LD, semantic tags), monitor extraction success rate per source, and fail loud on schema mismatch rather than silently emitting nulls.
Related terms
Programmatically extracting structured data from websites that were designed to be read by humans.
A program that systematically follows links between web pages to discover and index content at scale.
A pattern that identifies elements in an HTML document by tag, class, id, attribute, or position, used by stylesheets and (heavily) by web scrapers.
A query language for selecting nodes in an XML or HTML document using path expressions, widely used by scrapers when CSS selectors are not expressive enough.
JavaScript Object Notation, a lightweight text format for representing structured data, supported natively by every modern language.