Building an AI Web Research Agent: Web Scraping to Markdown to LLM in 10 Lines of Code

Every AI agent that reads the web does the same thing: fetch a page, clean it up so an LLM can actually use it, and pass it into a prompt. Competitor analysis bots, support agents that read docs, automated research tools, they all need this loop.

The loop is harder than it looks. Raw HTML is full of nav bars, cookie banners, tracking scripts, and layout markup that wastes tokens and confuses models. Headless browsers are slow and brittle. Anti-bot protections mean your scraper that worked yesterday will break tomorrow.

This tutorial builds an AI web research agent using Context.dev's Markdown API and OpenAI's GPT-4. The whole thing fits in 10 lines of code.

Why Markdown, not HTML or plain text

Before writing any code, it helps to understand why Markdown works better than the alternatives.

HTML is too noisy. A typical webpage is 80-90% structural markup, CSS classes, JavaScript, and metadata. Feeding that into an LLM wastes context window tokens on <div class="container-fluid px-4 mt-3"> instead of actual content. Models also struggle to tell navigation apart from body text, or ads from articles.

Plain text loses too much. Strip all HTML and you lose headings, lists, tables, links, and code blocks, the structural cues that help LLMs understand how information is organized. A product comparison table flattened into a paragraph is much harder to reason over.

Markdown hits the middle ground. GitHub Flavored Markdown (GFM) keeps headings, lists, links, tables, and code blocks while stripping everything else. It's also token-efficient, typically 5-10x fewer tokens than the equivalent HTML.

This is why most RAG pipelines and AI agents use Markdown as their web ingestion format. (For a broader look at the tools available, see our guide to the top 10 web scraping APIs for AI.)

The architecture

Every web research agent does three things:

Scrape a webpage and convert it to clean Markdown
Inject that Markdown into an LLM prompt as context
Generate a response grounded in the page content

Step 1 is where things get painful. You need a headless browser to render JavaScript, proxy rotation to avoid blocks, content extraction logic to strip boilerplate, and HTML-to-Markdown conversion that preserves structure. That can easily be hundreds of lines of infrastructure code before you've touched the AI part.

Context.dev's Markdown API wraps all of that in a single API call. It handles JavaScript rendering, anti-bot bypass (96%+ success rate on protected sites), content extraction, and Markdown conversion. You send a URL, you get clean Markdown back.

Building the agent

Here's the complete agent in TypeScript:

import ContextDev from 'context.dev';
import OpenAI from 'openai';
 
const ctx = new ContextDev({ apiKey: process.env.CONTEXT_DEV_API_KEY });
const openai = new OpenAI();
 
async function research(url: string, question: string) {
	const { markdown } = await ctx.web.webScrapeMd({ url });
	const response = await openai.chat.completions.create({
		model: 'gpt-4o',
		messages: [
			{ role: 'system', content: `Answer the question using this web page:\n\n${markdown}` },
			{ role: 'user', content: question },
		],
	});
	return response.choices[0].message.content;
}

That's it. Call research('https://docs.stripe.com/api', 'How do I create a payment intent?') and you get an answer based on the live page, not whatever the model memorized during training.

Here's what each step does:

ctx.web.webScrapeMd({ url }) sends the URL to Context.dev, which renders the page in a headless browser, strips navigation and boilerplate, and returns clean GFM Markdown.
The Markdown is injected into the system prompt as context for the LLM.
The LLM answers the user's question using the provided web content.

No Puppeteer. No proxy configuration. No HTML parsing. No content extraction heuristics.

Python version

Same thing in Python:

from context.dev import ContextDev
from openai import OpenAI
 
ctx = ContextDev(api_key="YOUR_CONTEXT_DEV_API_KEY")
openai_client = OpenAI()
 
def research(url: str, question: str) -> str:
    page = ctx.web.web_scrape_md(url=url)
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Answer the question using this web page:\n\n{page.markdown}"},
            {"role": "user", "content": question},
        ],
    )
    return response.choices[0].message.content

Multi-page research

A single-page agent gets you surprisingly far, but real research usually spans multiple pages. Here's how to scrape an entire site and answer questions across all of it.

Context.dev's Sitemap API finds every page on a domain by crawling its sitemaps. Combine it with the Markdown API and you can ingest a whole site in a few lines:

import ContextDev from 'context.dev';
import OpenAI from 'openai';
 
const ctx = new ContextDev({ apiKey: process.env.CONTEXT_DEV_API_KEY });
const openai = new OpenAI();
 
async function researchSite(domain: string, question: string) {
	// 1. Discover all pages on the site
	const { urls } = await ctx.web.webScrapeSitemap({ domain });
 
	// 2. Scrape the first 20 pages as Markdown
	const pages = await Promise.all(urls.slice(0, 20).map((url) => ctx.web.webScrapeMd({ url })));
 
	// 3. Combine all content and ask the LLM
	const context = pages.map((p) => `## ${p.url}\n\n${p.markdown}`).join('\n\n---\n\n');
	const response = await openai.chat.completions.create({
		model: 'gpt-4o',
		messages: [
			{ role: 'system', content: `Answer using these web pages:\n\n${context}` },
			{ role: 'user', content: question },
		],
	});
	return response.choices[0].message.content;
}

Now you can call researchSite('stripe.com', 'What payment methods does Stripe support in Europe?') and get an answer synthesized from multiple pages across Stripe's documentation.

Tuning the output

The Markdown API has a few parameters worth knowing about:

includeLinks (default: true) controls whether hyperlinks come through as Markdown links ([text](url)) or get stripped to plain text. Keep them on when your agent needs to follow references or cite sources. Turn them off for RAG ingestion where links just add noise.

includeImages (default: false) controls whether image references are included. Useful for multimodal pipelines where you want to pass image URLs to a vision model alongside the text.

shortenBase64Images (default: true) truncates inline base64 image strings, which can be tens of thousands of characters long. Leave it on unless you actually need the raw image data.

// For a RAG pipeline: strip links and images, minimize tokens
const { markdown } = await ctx.web.webScrapeMd({
	url: 'https://example.com/docs',
	includeLinks: false,
	includeImages: false,
});
 
// For an agent that follows links and processes images
const { markdown } = await ctx.web.webScrapeMd({
	url: 'https://example.com/product',
	includeLinks: true,
	includeImages: true,
	shortenBase64Images: true,
});

Examples

Competitive intelligence

Monitor a competitor's pricing page and pull out what changed:

async function analyzeCompetitor(domain: string) {
	const { urls } = await ctx.web.webScrapeSitemap({ domain });
	const pricingPage = urls.find((u) => u.includes('pricing'));
 
	if (pricingPage) {
		const { markdown } = await ctx.web.webScrapeMd({ url: pricingPage });
		return research(pricingPage, 'Summarize the pricing tiers, and list what each tier includes.');
	}
}

Documentation QA

Point a support bot at live docs instead of stale embeddings:

async function answerFromDocs(docsUrl: string, userQuestion: string) {
	const { markdown } = await ctx.web.webScrapeMd({ url: docsUrl, includeLinks: true });
	const response = await openai.chat.completions.create({
		model: 'gpt-4o',
		messages: [
			{
				role: 'system',
				content: `You are a helpful support agent. Answer the user's question using the documentation below. If you reference a section, include the link.\n\n${markdown}`,
			},
			{ role: 'user', content: userQuestion },
		],
	});
	return response.choices[0].message.content;
}

Multi-source research

Pull content from several URLs and synthesize it:

async function researchTopic(urls: string[], topic: string) {
	const pages = await Promise.all(urls.map((url) => ctx.web.webScrapeMd({ url })));
	const sources = pages.map((p) => `Source: ${p.url}\n${p.markdown}`).join('\n\n---\n\n');
 
	const response = await openai.chat.completions.create({
		model: 'gpt-4o',
		messages: [
			{
				role: 'system',
				content: `You are a research analyst. Synthesize the following sources into a comprehensive briefing on: ${topic}\n\n${sources}`,
			},
			{ role: 'user', content: `Write a briefing with key findings, trends, and actionable takeaways.` },
		],
	});
	return response.choices[0].message.content;
}

Why not just use Puppeteer or Playwright?

If you've built scrapers before, you know this is a fair question. For a prototype, Puppeteer works. At scale, the problems compound:

Running headless Chrome in production means managing memory leaks, zombie processes, and horizontal scaling. That's real DevOps work with no product payoff.

Modern sites use Cloudflare, DataDome, PerimeterX, and similar services to block bots. Getting past them requires rotating residential proxies, fingerprint spoofing, and constant maintenance as detection evolves. Context.dev handles all of this, with a 96%+ first-attempt success rate on protected sites.

Going from rendered HTML to clean Markdown is also harder than it looks. You have to identify the main content area, strip navigation and ads, preserve semantic structure, handle iframes and shadow DOM, and convert to proper Markdown. That's a lot of code for something that isn't your product.

And self-hosted scrapers break. A site redesign, a new bot detection vendor, a changed DOM structure, any of these will kill your scraper at 2 AM. With an API, that's someone else's 2 AM.

Scaling to production

A few things to think about as you move past the prototype stage:

Cache the Markdown output. If your agent hits the same URLs repeatedly, a 15-minute or 1-hour cache will cut your API calls significantly. Web content doesn't change that fast.

Chunk long pages. Context windows are large but not infinite. For very long pages, split the Markdown by heading and pass only the sections that matter. This helps both accuracy and cost.

Scrape in parallel. Use Promise.all (or asyncio.gather in Python) to fetch multiple pages concurrently. Context.dev handles thousands of concurrent requests, so the bottleneck is usually the LLM, not the scraping.

Expect some pages to fail. Authentication, paywalls, downtime, there are plenty of reasons a scrape won't work. Your agent should report which sources it couldn't access rather than crashing.

Getting started

Sign up for a free Context.dev account at context.dev/signup
Install the SDK: npm install context.dev or pip install context-dev
Grab your API key from the dashboard
Copy the 10-line agent above and start experimenting

The Markdown API docs cover parameters, response formats, and rate limits. The Sitemap API docs cover the multi-page endpoint.

The whole scraping-to-LLM pipeline doesn't need to be a multi-week project. Get the data layer working in an afternoon and spend the rest of your time on the parts that matter.