Backed by

Combinator

Turn the Web into LLM ready data

Scrape, crawl, and convert any website into clean Markdown, HTML, images, or structured data with a single API call. Built for RAG pipelines, AI agents, and knowledge bases.

No card required

Book a Call

How it works

From raw web data to grounded AI in three steps

Request

Scrape web context

Pass any URL or domain to retrieve HTML, Markdown, images, a full sitemap, or crawl an entire site in one API call.

const data = await client.web

.scrape({

url: "example.com"

})

Context

Inject into your LLM prompt

Include the scraped content in your system prompt or context window alongside your user query.

messages: [{

role: "system",

content: data.markdown

}]

Output

Get grounded responses

Your model answers with real web context — no hallucinations about outdated or missing information.

// Grounded in real data

"Revenue grew 23% YoY

to $4.2B, driven by

cloud adoption..."

Endpoints

Five ways to extract web content

POST/v1/web/crawl

Crawl Website

Crawl pages starting from a single URL and extract every page as Markdown. Only follows links within the same domain. Costs 1 credit per successful page crawled.

Site-wide scrapingDocumentation ingestionFull-site RAG

API docs

// 47 pages crawled

✓/docs/quickstartdepth 0

✓/docs/api-referencedepth 1

✓/docs/authenticationdepth 1

... 44 more pages

GET/v1/web/scrape/html

Raw HTML

Full HTML source of any URL. Feed it into a custom parser or extract specific DOM elements before handing off to your model.

DOM parsingStructured extractionCustom pipelines

API docs

<html>
  <body>
    <h1>Pricing</h1>
    <div class="plan">...

GET/v1/web/scrape/markdown

Markdown

Clean GitHub Flavored Markdown. Strips noise, preserves semantic structure. Cuts token usage dramatically versus raw HTML.

LLM contextRAG pipelinesKnowledge bases

API docs

# Pricing

Compare plans and find
the right fit for your
team.

## Starter
- 10k requests/mo

GET/v1/web/scrape/images

Images

Every image from a page — img tags, inline SVGs, base64 URIs, picture elements, video posters. Returns src, format, and alt text.

Multimodal AIVisual searchImage indexing

API docs

pnghero-banner.png1200px

svglogo-mark.svg64px

webpteam-photo.webp800px

GET/v1/web/scrape/sitemap

Sitemap

Discover page URLs by crawling sitemaps recursively. Build a URL list before batch-scraping a site into your vector store.

Batch indexingContent discoveryRAG ingestion

API docs

1/pricing

2/docs/quickstart

3/blog/launch-week

4/changelog

5/api-reference

... more discovered URLs

Built for

Teams that need live web data in their AI stack

RAG Pipeline Engineers

Building retrieval-augmented generation systems

Challenge

Keeping LLM knowledge current requires constant re-indexing. Models hallucinate when their training data is outdated relative to live web content.

Solution

Use the Markdown endpoint to fetch and embed fresh web pages on demand. Combine with the Sitemap endpoint to bulk-ingest entire domains into your vector store.

Impact

Always-current RAG context without manual scraping infrastructure.

AI Agent Developers

Building autonomous agents that browse the web

Challenge

Agents need to read web pages, but running a browser per-agent is expensive, slow, and hard to scale.

Solution

Replace headless browser calls with single API requests. Get HTML or Markdown with automatic proxy escalation built in.

Impact

Agents that browse the web reliably at scale with zero browser infra.

AI Search Products

Building AI-powered search and answer engines

Challenge

Answering questions about current events or specific URLs requires live web access. Training data cutoffs leave models unable to answer about recent content.

Solution

Scrape the relevant URL to Markdown and inject it as context before generating an answer.

Impact

Search results grounded in real, current content users can trust.

Competitive Intelligence

Monitoring competitors and market changes

Challenge

Tracking competitor websites for pricing, product, and messaging changes requires scrapers that constantly break.

Solution

Use the HTML or Markdown endpoint to snapshot competitor pages on a schedule. Diff the output and feed changes to an LLM for analysis.

Impact

Know when a competitor changes pricing or messaging within hours.

Knowledge Base Tools

Ingesting web content into knowledge systems

Challenge

Users want to import web pages into their knowledge bases, but HTML is noisy and hard to chunk correctly for embedding.

Solution

Use the Markdown endpoint for clean, well-structured content. Use Sitemap to discover all pages in a documentation site for bulk import.

Impact

Faster, cleaner knowledge base ingestion with better retrieval.

LLM Evaluation Teams

Testing and benchmarking language models

Challenge

Evaluating how well a model handles real-world web content requires a reliable way to fetch diverse, current web pages at scale.

Solution

Programmatically scrape a wide range of pages to build evaluation datasets grounded in real, current web content.

Impact

Richer, more realistic model evaluations against the actual web.

Content Automation

Generating content from web sources at scale

Challenge

Content teams want AI that can research topics from the live web before writing, but connecting models to live data requires custom infrastructure.

Solution

Fetch relevant pages as Markdown and inject into the generation prompt. Your content AI researches and cites live sources automatically.

Impact

AI-written content grounded in current, verifiable sources.

Multimodal AI Apps

Building apps that reason over images and text

Challenge

Extracting images from web pages to feed into vision models requires parsing complex HTML, handling SVGs, base64 URIs, and multiple image types.

Solution

Use the Images endpoint to extract every image from any URL in a structured array, classified by type and element source.

Impact

Reliable image extraction without brittle custom scrapers.

Ship an agent that actually knows things.

Free tier, 10-minute integration, and the same API powering agents at Mintlify, daily.dev, and Propane. No credit card to start.

Get API Access

Book a call