How to Extract Raw HTML from Any URL with a Single API Call

Sometimes you just need the HTML.

Not parsed Markdown or a structured JSON extraction or an AI summary. Just the raw, fully-rendered HTML of a webpage, the same DOM a real browser sees after JavaScript execution, redirects, and client-side rendering.

You might be feeding it into a custom parser, diffing two versions of a page, or building a monitoring tool that watches for DOM changes. Whatever the reason, "give me the HTML at this URL" sounds simple, and it's way more painful to do reliably than it should be.

This guide covers the practical ways to extract raw HTML from a URL, the problems you'll hit doing it yourself, and how a single API call can replace a lot of infrastructure code.

Why "Just Fetch the HTML" Is Harder Than It Sounds

If you've ever written fetch(url) or requests.get(url) and called it a day, you've probably already been burned. The HTML you get back from a simple HTTP request is often not what you see in a browser. Here's why.

JavaScript-Rendered Pages

The modern web runs on client-side JavaScript. React, Next.js, Vue, Angular — these frameworks render content after the initial page load. A basic HTTP request returns the bare shell:

<!DOCTYPE html>
<html>
	<body>
		<div id="root"></div>
		<script src="/bundle.js"></script>
	</body>
</html>

That's not the HTML you wanted. The actual content — product listings, article text, pricing tables — is injected by JavaScript after the page loads. To get the real HTML, you need a headless browser that executes JavaScript and waits for the DOM to stabilize.

Anti-Bot Protections

Cloudflare, Akamai, PerimeterX, DataDome — a growing number of sites use bot detection that blocks automated requests outright. Your curl command returns a CAPTCHA page or a 403. Your Python script gets rate-limited after three requests.

Getting past these protections requires rotating proxies, browser fingerprinting, TLS fingerprint matching, and sometimes residential IP addresses. This isn't a weekend project. It's ongoing infrastructure.

Redirects, Cookies, and Authentication Walls

Many URLs don't serve content directly. They redirect through chains of shortened URLs, tracking pixels, and consent gates. Some pages require cookies from a previous visit. Others sit behind login walls or geo-restricted CDNs.

A simple GET request doesn't handle any of this. A headless browser with proper session management does.

Dynamic Content and Lazy Loading

Even with JavaScript execution, some content only appears after scrolling, clicking, or waiting for async API calls to resolve. Images lazy-load. Infinite scroll pages only render the first batch. Tabs hide content until clicked.

Getting the "full" HTML means knowing when to stop waiting — and that's harder than it sounds.

The DIY Approach: Headless Browsers

The standard solution for developers who need rendered HTML is to run a headless browser — typically Puppeteer (Chrome) or Playwright (Chrome, Firefox, or WebKit).

Here's a minimal Puppeteer example:

import puppeteer from 'puppeteer';
 
const extractHtml = async (url: string): Promise<string> => {
	const browser = await puppeteer.launch({ headless: true });
	const page = await browser.newPage();
 
	await page.goto(url, { waitUntil: 'networkidle2' });
	const html = await page.content();
 
	await browser.close();
	return html;
};
 
const html = await extractHtml('https://example.com');
console.log(html);

This works for a demo. It falls apart in production:

Resource overhead. Each browser instance consumes 100-300MB of RAM. If you're extracting HTML from 50 URLs concurrently, you need serious compute. Chrome processes leak memory over time, so you need restart logic.

Startup latency. Launching a browser takes 1-3 seconds. Navigating to a page, waiting for JavaScript to execute, and waiting for network idle adds another 3-10 seconds. For a "just get the HTML" use case, this is painfully slow.

Anti-bot detection. Headless Chrome is detectable. Sites check navigator.webdriver, the Chrome DevTools protocol fingerprint, and dozens of other signals. You need puppeteer-extra-plugin-stealth and constant updates to stay ahead.

Infrastructure management. You need to keep Chrome/Chromium updated, manage browser pools, handle crashes and timeouts, and set up retry logic. On serverless platforms, you need special Chrome builds (chrome-aws-lambda) because the standard binary is too large.

Proxy management. If you're scraping at any scale, you need rotating proxies to avoid IP bans. That means proxy providers, rotation logic, error handling for dead proxies, and billing management.

All of this to answer the question: "What's the HTML at this URL?"

The API Approach: One Request, Done

The alternative is to let someone else manage the headless browsers, proxy infrastructure, and anti-bot bypasses, and just call an API. You send a URL, it renders the page in a real browser, deals with anti-bot protections, and hands back the complete HTML. One request, one response.

Here's what that looks like with Context.dev's HTML scraping API:

cURL

curl -X GET "https://api.context.dev/v1/web/scrape/html?url=https://example.com" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY"

That's it. One request. The response:

{
	"success": true,
	"url": "https://example.com",
	"html": "<!DOCTYPE html><html>...</html>"
}

The url field returns the resolved URL after any redirects, so you always know where the content actually came from.

Python

Using the Context.dev Python SDK:

from context.dev import ContextDev
 
client = ContextDev(api_key="YOUR_API_KEY")
 
response = client.web.web_scrape_html(url="https://example.com")
print(response.html)

Three lines of code. No browser management, no proxy rotation, no stealth plugins.

JavaScript

const { ContextDev } = require('context.dev');
 
const client = new ContextDev({ apiKey: 'YOUR_API_KEY' });
 
const response = await client.web.webScrapeHTML({
	url: 'https://example.com',
});
console.log(response.html);

TypeScript

import { ContextDev } from 'context.dev';
 
const client = new ContextDev({ apiKey: 'YOUR_API_KEY' });
 
const scrapeHtml = async (url: string) => {
	const response = await client.web.webScrapeHTML({ url });
	return response.html;
};
 
const html = await scrapeHtml('https://example.com');

Ruby

require "context_dev"
 
client = ContextDev::Client.new(api_key: "YOUR_API_KEY")
 
response = client.web.web_scrape_html(url: "https://example.com")
puts response.html

Using Axios (No SDK)

If you prefer not to use an SDK, the REST API works with any HTTP client:

import axios from 'axios';
 
const scrapeHtml = async (url: string) => {
	const response = await axios.get('https://api.context.dev/v1/web/scrape/html', {
		params: { url },
		headers: {
			'Content-Type': 'application/json',
			Authorization: `Bearer YOUR_API_KEY`,
		},
	});
	return response.data.html;
};

What Happens Behind the Scenes

When you call the HTML endpoint, Context.dev handles the full rendering pipeline.

The URL gets loaded in a headless browser that executes all JavaScript and waits for the DOM to stabilize, so you get the rendered HTML, not the pre-render source. If a request is blocked, the system escalates through proxy tiers (datacenter, residential, mobile) until it gets through. All redirects are followed, and the final resolved URL comes back alongside the HTML so you know what page was actually rendered. Browser fingerprinting, TLS fingerprint matching, and session management are all automatic — no stealth plugins to configure on your end.

The result is the same HTML you'd see if you opened Chrome DevTools and copied the <html> element, but delivered via a GET request.

Common Use Cases for Raw HTML Extraction

Custom Parsing Pipelines

If you're building a scraper for a specific data format — say, extracting structured product data from e-commerce sites or pulling financial data from SEC filings — you need the raw HTML to run your own parsing logic. CSS selectors, XPath queries, and regex patterns all operate on HTML. An API that returns clean HTML lets you skip the browser infrastructure and focus on the parsing.

DOM Diffing and Change Detection

Monitoring websites for changes — price drops, content updates, new job postings — requires comparing HTML snapshots over time. You need consistent, fully-rendered HTML for diffs to be meaningful. If your scraper sometimes returns JavaScript-rendered content and sometimes returns the raw shell, your diffs are useless.

SEO Auditing and Analysis

SEO tools need to see what search engines see. That means fully rendered HTML with all meta tags, Open Graph data, structured data (JSON-LD), and canonical URLs intact. Extracting raw HTML is the first step in any programmatic SEO audit.

Archiving and Compliance

Some industries require archiving web content for compliance. Financial services, legal, and healthcare organizations often need to store snapshots of web pages as they appeared at a specific point in time. Raw HTML is the most complete representation.

Feeding Data into AI Pipelines

While Markdown conversion is often better for LLM consumption, there are cases where you want the raw HTML — for example, when training models on web layout structure, building accessibility tools, or extracting data from HTML tables that lose structure in Markdown conversion.

Going Beyond HTML: Related Endpoints

Once you're extracting HTML from URLs, you'll probably need related capabilities too. Context.dev has a few other scraping endpoints worth knowing about:

URL to Markdown

If your goal is to feed web content into an LLM or RAG pipeline, raw HTML is usually more than you need. The Markdown endpoint returns clean GitHub Flavored Markdown with navigation, ads, and cookie banners stripped out:

response = client.web.web_scrape_md(
    url="https://example.com/blog/post",
    include_links=True,
    include_images=False
)
print(response.markdown)

You can control whether links and images are preserved — useful for RAG pipelines where you want text-only content, versus agent workflows where links matter.

Image Extraction

The Images endpoint extracts all images from a page — <img> tags, inline SVGs, base64-encoded images, <picture> sources, and video poster frames — with alt text and element metadata:

response = client.web.web_scrape_images(url="https://example.com")
for image in response.images:
    print(image.src, image.alt, image.type)

Sitemap Discovery

Before scraping individual pages, you often need to know what pages exist. The Sitemap endpoint discovers and parses a domain's sitemaps, returning up to 500 deduplicated URLs:

response = client.web.web_scrape_sitemap(domain="example.com")
for url in response.urls:
    html = client.web.web_scrape_html(url=url)
    # Process each page

AI-Powered Data Extraction

If you're extracting HTML just to parse out specific data points, the AI Query endpoint can skip the middleman entirely. Describe what you want in plain English, and get structured JSON back:

response = client.brand.ai_query(
    domain="example.com",
    data_to_extract=[
        {
            "datapoint_name": "pricing",
            "datapoint_description": "Monthly pricing for the pro plan",
            "datapoint_example": "$49/mo",
            "datapoint_type": "text"
        }
    ]
)

No HTML parsing, no CSS selectors, no regex. You just describe what you want and get it back as JSON.

DIY vs. API: When to Use Which

Not every use case needs an API. Here's how I'd think about it:

Use a simple HTTP request (requests.get / fetch) when:

  • The target page is server-rendered (no JavaScript frameworks)
  • There's no bot protection
  • You're scraping a small number of pages infrequently
  • You control the target site

Use a headless browser (Puppeteer/Playwright) when:

  • You need to interact with the page (click, scroll, fill forms)
  • You're building a browser automation tool, not a data pipeline
  • You're scraping your own application for testing

Use an HTML extraction API when:

  • You need rendered HTML from JavaScript-heavy sites
  • Target sites have anti-bot protections
  • You're scraping at scale (dozens to thousands of URLs)
  • You don't want to manage browser infrastructure
  • Reliability matters — you need consistent results, not "works most of the time"

For most production use cases, the API wins on both developer time and operational cost. Running headless Chrome at scale gets expensive fast, and not just in compute — it's the engineering hours spent keeping it working that really add up.

Getting Started

Context.dev offers SDKs for TypeScript, Python, and Ruby, plus a REST API that works with any HTTP client.

To start extracting HTML:

  1. Sign up for a free account
  2. Grab your API key from the dashboard
  3. Make your first request:
curl -X GET "https://api.context.dev/v1/web/scrape/html?url=https://example.com" \
  -H "Authorization: Bearer YOUR_API_KEY"

The response includes the fully rendered HTML, the resolved URL, and a success flag. No browsers to manage, no proxies to configure, no stealth plugins to maintain. Just the HTML.

Frequently Asked Questions

Does the API execute JavaScript?

Yes. The HTML endpoint renders pages in a full headless browser, so you get the complete DOM after all JavaScript has executed — not the pre-render HTML source.

How does it handle pages that block scrapers?

Context.dev uses automatic proxy escalation. If a request is blocked, it retries through progressively more sophisticated proxy tiers (datacenter, residential, mobile) with proper browser fingerprinting. This achieves a 96%+ success rate on protected pages.

What's the rate limit?

Rate limits depend on your plan. The API returns standard HTTP 429 responses when limits are hit, with retry-after headers. Check the pricing page for details.

Can I scrape pages behind a login?

The HTML endpoint works with publicly accessible URLs. For pages behind authentication, you'd need to pass session cookies or use a headless browser with login automation.

What's the difference between the HTML and Markdown endpoints?

The HTML endpoint returns the raw rendered DOM — every tag, attribute, and script. The Markdown endpoint processes the HTML to extract just the readable content in clean Markdown format, stripping navigation, ads, and boilerplate. Use HTML when you need the full DOM structure. Use Markdown when you need the content for LLMs or human reading.

Is there a maximum page size?

Standard web pages work fine. Extremely large pages (multi-megabyte DOMs) may time out, but you can set a custom timeout up to 300 seconds (5 minutes) with the timeoutMS parameter.

Context at scale

Join 5,000+ businesses using Context.dev to enrich their products with structured web data.