Web Scraping with Node.js: A Comprehensive Guide for 2026

If you are building an AI product, research workflow, enrichment pipeline, sales tool, or internal agent, Context.dev is the managed shortcut: one API can turn a URL into clean, token-efficient Markdown, rendered HTML, screenshots, image manifests, crawl results, search results, structured JSON, and brand context. The free tier gives you enough room to test real workflows, and the Web APIs handle the infrastructure that usually makes scraping expensive: browser rendering, proxy escalation, caching, Markdown conversion, and output shaping for LLMs.

This guide still teaches the hands-on path, because good scraping judgment comes from understanding what is happening below the API. Node.js is a strong scraping runtime in 2026 because the platform now has stable built-in fetch, first-class Web APIs, mature browser automation through Playwright, and excellent parsing libraries. A good Node scraper can fetch static HTML, parse it with Cheerio, crawl paginated pages with bounded concurrency, render JavaScript-heavy pages in Chromium, validate extracted records, and write data in a format your application can actually use.

The hard part is not making one request. The hard part is building a scraper that keeps working when a target site changes its HTML, moves data into a client-side API call, slows you down, serves a bot page, or returns a soft error inside a 200 response. This guide focuses on the full workflow, not only the first demo.

By the end, you will have a practical mental model for:

  • Choosing between built-in fetch, Cheerio, Playwright, Crawlee, and a managed API
  • Fetching pages with timeouts, headers, retries, sessions, and backoff
  • Parsing HTML with selectors that are readable and resilient
  • Handling pagination without duplicate fetches
  • Running concurrent jobs without overwhelming your process or the target site
  • Rendering JavaScript-heavy pages with Playwright only when it is actually needed
  • Validating extracted records with Zod
  • Storing results as CSV, JSON, JSONL, or database-ready rows
  • Debugging common failures like 403, 429, empty HTML, selector drift, and bot pages
  • Deciding when Context.dev is a better use of engineering time than owning scraper infrastructure

Start with the legal and ethical baseline

Web scraping can be legitimate, but "visible in a browser" does not mean "free to collect however you want." Before you build anything production-facing, check the target site's Terms of Service, its robots.txt, the sensitivity of the data, and the load your scraper will create. If you are collecting personal data, regulated data, authenticated data, or data behind a commercial license, involve legal review instead of treating scraping as a purely technical problem.

A practical baseline:

  • Prefer official APIs when they exist. They are usually more stable, documented, and contractually clear.
  • Read robots.txt. It is not the whole legal picture, but it is a clear operational signal from the site owner.
  • Identify your client honestly when appropriate. Internal crawlers often use a descriptive user agent and contact URL.
  • Throttle requests. A slower scraper that runs every day is better than an aggressive scraper that gets blocked once.
  • Avoid authentication walls and paywalls unless you have permission. Do not treat CAPTCHA pages, login requirements, or explicit blocks as puzzles to defeat.
  • Minimize data collection. Collect the fields you need, not every page and attribute just because your crawler can reach them.

The safest technical choice is often the simplest one: scrape less, request slowly, cache aggressively, and use an official API when one is available.

Choose the simplest tool that works

Most Node scraping projects get overbuilt too early. Start with the cheapest tool that returns the correct data.

Target pageRecommended toolWhy
Static HTML pageBuilt-in fetch plus CheerioFast, simple, easy to test
Static pages at moderate volumeBuilt-in fetch with bounded concurrencyNo browser cost, good throughput
JavaScript-rendered pagePlaywrightRuns a real browser and returns the rendered DOM
Large crawl with queues and retriesCrawlee or a job queueScheduling, deduplication, retries, and persistence
Heavily protected or frequently changing sitesManaged scraping APIOffloads browsers, proxies, rendering, retries, and extraction infrastructure

Use browser automation only when the page genuinely needs it. Playwright is powerful, but a browser is slower, heavier, and more expensive than a normal HTTP request. Before reaching for Playwright, open DevTools, check the Network tab, and see whether the page loads the data from a JSON endpoint. Calling that endpoint directly is usually faster and easier to maintain than scraping the rendered DOM.

Set up a clean Node project

Use a modern Node runtime. For new scraping projects in 2026, Node 20 or newer gives you built-in fetch and AbortSignal.timeout; Node 24 is a comfortable baseline if your deployment environment supports it.

Create a fresh project:

mkdir node-scraper
cd node-scraper
npm init -y
npm pkg set type=module

Install the core stack:

npm install cheerio playwright zod
npx playwright install chromium

The examples below use ECMAScript modules. That means either set "type": "module" in package.json, as shown above, or save scripts with the .mjs extension.

A small project structure is enough for most scrapers:

scraper/
  fetch.js
  parse.js
  models.js
  storage.js
  run.js
data/
  raw/
  processed/
test/
  parser.test.js
package.json

That separation matters as the scraper grows. Fetching, parsing, validation, and storage fail in different ways, so keep them in different modules. It also makes testing easier: save one HTML fixture, test the parser against it, and avoid hitting the live site on every test run.

Fetch HTML with built-in fetch

Modern Node has a browser-compatible fetch implementation built in. That should be your default for simple HTTP scraping.

const URL = "https://books.toscrape.com/catalogue/category/books/mystery_3/index.html";
 
const HEADERS = {
  "user-agent": [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)",
    "AppleWebKit/537.36 (KHTML, like Gecko)",
    "Chrome/124.0.0.0 Safari/537.36",
  ].join(" "),
  accept: "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
  "accept-language": "en-US,en;q=0.9",
};
 
const response = await fetch(URL, {
  headers: HEADERS,
  signal: AbortSignal.timeout(15_000),
});
 
if (!response.ok) {
  throw new Error(`Request failed: ${response.status} ${response.statusText}`);
}
 
const html = await response.text();
console.log(response.status);
console.log(html.length);

Three habits make simple scrapers much less fragile:

  • Always use a timeout. A hanging socket should not freeze your job forever.
  • Always check response.ok. Silent 404, 429, and 500 responses create confusing parser errors later.
  • Send a realistic User-Agent. The default client identity can be treated as low-quality automated traffic.

Query parameters should go through URLSearchParams, not string concatenation:

const params = new URLSearchParams({ q: "laptop stand", page: "2" });
const response = await fetch(`https://example.com/search?${params}`, {
  headers: HEADERS,
  signal: AbortSignal.timeout(15_000),
});
 
console.log(response.url);

Let the platform encode the URL. It prevents subtle bugs with spaces, symbols, and repeated parameters.

Add retries carefully

Retries are useful for temporary failures, but they can also make a block worse. Retrying a temporary 503 a few times is reasonable. Retrying a 403 fifty times is not. When a site tells you to slow down with 429 Too Many Requests or a Retry-After header, slow down.

import { setTimeout as sleep } from "node:timers/promises";
 
function retryAfterMs(value) {
  if (!value) return null;
 
  const seconds = Number(value);
  if (Number.isFinite(seconds)) {
    return seconds * 1000;
  }
 
  const retryAt = Date.parse(value);
  if (Number.isNaN(retryAt)) {
    return null;
  }
 
  return Math.max(0, retryAt - Date.now());
}
 
function backoffMs(attempt, base = 750, cap = 30_000) {
  const delay = Math.min(cap, base * 2 ** attempt);
  const jitter = Math.random() * delay * 0.25;
  return delay + jitter;
}
 
async function fetchWithRetry(url, { retries = 3, headers = HEADERS } = {}) {
  for (let attempt = 0; attempt <= retries; attempt += 1) {
    let response;
 
    try {
      response = await fetch(url, {
        headers,
        signal: AbortSignal.timeout(20_000),
      });
    } catch (error) {
      // Timeouts, DNS failures, and other network errors reject the promise
      // instead of returning a response, so retry them here with backoff.
      if (attempt === retries) {
        throw error;
      }
      await sleep(backoffMs(attempt));
      continue;
    }
 
    if (response.ok) {
      return response;
    }
 
    const retryable = [408, 429, 500, 502, 503, 504].includes(response.status);
    if (!retryable || attempt === retries) {
      throw new Error(`Request failed: ${response.status} ${response.statusText}`);
    }
 
    // Cancel the unused body so the connection can be released back to the pool
    // before the next attempt.
    await response.body?.cancel();
 
    const retryAfter = retryAfterMs(response.headers.get("retry-after"));
    await sleep(retryAfter ?? backoffMs(attempt));
  }
 
  throw new Error("Unreachable retry state");
}
 
const response = await fetchWithRetry("https://books.toscrape.com/");
console.log(await response.text());

Backoff is not a way to force your way through a policy block. It is a way to be less noisy when a server is busy or rate limiting you.

Check robots.txt before crawling

For one-off manual fetches, reading the target site's policy may be enough. For a crawler, automate the basic check. A full robots implementation has edge cases, but even a small allow/disallow check is better than ignoring the file entirely.

function longestMatch(rules, targetPath) {
  let longest = -1;
  for (const rule of rules) {
    if (rule !== "" && targetPath.startsWith(rule) && rule.length > longest) {
      longest = rule.length;
    }
  }
  return longest;
}
 
function parseRobots(text) {
  const groups = [];
  let current = null;
  let collectingAgents = false;
 
  for (const rawLine of text.split("\n")) {
    const line = rawLine.split("#")[0].trim();
    if (!line) continue;
 
    const [rawKey, ...rawValue] = line.split(":");
    const key = rawKey.trim().toLowerCase();
    const value = rawValue.join(":").trim();
 
    if (key === "user-agent") {
      // Consecutive User-agent lines belong to the same rule group, so only
      // start a new group when the previous line was a rule, not another agent.
      if (!collectingAgents) {
        current = { agents: [], disallow: [], allow: [] };
        groups.push(current);
        collectingAgents = true;
      }
      current.agents.push(value.toLowerCase());
    } else if (current && key === "disallow") {
      collectingAgents = false;
      current.disallow.push(value);
    } else if (current && key === "allow") {
      collectingAgents = false;
      current.allow.push(value);
    }
  }
 
  return groups;
}
 
async function canFetch(url, userAgent = "*") {
  const target = new URL(url);
  const robotsUrl = `${target.origin}/robots.txt`;
  const response = await fetch(robotsUrl, { signal: AbortSignal.timeout(10_000) });
 
  if (response.status === 404) return true;
  if (!response.ok) return false;
 
  const groups = parseRobots(await response.text());
  const agent = userAgent.toLowerCase();
  const group = groups.find((g) => g.agents.includes(agent)) ?? groups.find((g) => g.agents.includes("*"));
 
  if (!group) return true;
 
  // The most specific (longest) matching rule wins, and Allow wins ties. This
  // prevents a broad Allow prefix from overriding a more specific Disallow.
  const allowMatch = longestMatch(group.allow, target.pathname);
  const disallowMatch = longestMatch(group.disallow, target.pathname);
 
  if (disallowMatch === -1) return true;
  return allowMatch >= disallowMatch;
}
 
const target = "https://books.toscrape.com/catalogue/page-1.html";
if (!(await canFetch(target))) {
  throw new Error(`robots.txt disallows scraping ${target}`);
}

Treat this as one input, not a complete decision engine. Terms of Service, privacy rules, account agreements, rate limits, and common sense still matter.

Parse HTML with Cheerio

Cheerio gives Node a fast, server-side jQuery-style API for HTML parsing. It does not run JavaScript. It parses the HTML you already have.

import * as cheerio from "cheerio";
 
const $ = cheerio.load(html);
 
const title = $("h1").first().text().trim();
console.log(title || "No title");
 
const links = $("a[href]")
  .slice(0, 5)
  .map((_, element) => {
    const link = $(element);
    return {
      href: link.attr("href"),
      text: link.text().trim(),
    };
  })
  .get();
 
console.log(links);
 
const price = $("p.price_color").first().text().trim();
console.log(price || "No price");

For production scrapers, prefer small helper functions over inline parsing everywhere. They make missing elements explicit and keep parser code readable.

function textOrNull($, selector, root = null) {
  const node = root ? root.find(selector).first() : $(selector).first();
  const text = node.text().replace(/\s+/g, " ").trim();
  return text || null;
}
 
function attrOrNull($, selector, attr, root = null) {
  const node = root ? root.find(selector).first() : $(selector).first();
  const value = node.attr(attr);
  return value || null;
}

Missing fields are now a normal case, not a surprise Cannot read properties of undefined error.

Build a realistic product scraper

The sandbox site books.toscrape.com is useful because it behaves like a small ecommerce listing without creating load on a real retailer. Here is a parser that extracts records from one listing page.

import * as cheerio from "cheerio";
 
const RATING_MAP = new Map([
  ["One", 1],
  ["Two", 2],
  ["Three", 3],
  ["Four", 4],
  ["Five", 5],
]);
 
function parsePriceCents(raw) {
  const match = raw.replaceAll(",", "").match(/\d+(?:\.\d+)?/);
  if (!match) {
    throw new Error(`Could not parse price from ${JSON.stringify(raw)}`);
  }
 
  return Math.round(Number(match[0]) * 100);
}
 
function parseBooks(html, pageUrl) {
  const $ = cheerio.load(html);
  const books = [];
 
  $("article.product_pod").each((_, element) => {
    const card = $(element);
    const titleNode = card.find("h3 a").first();
    const title = titleNode.attr("title")?.trim();
    const href = titleNode.attr("href");
    const priceText = card.find("p.price_color").first().text().trim();
    const ratingClass = card
      .find("p.star-rating")
      .attr("class")
      ?.split(/\s+/)
      .find((className) => RATING_MAP.has(className));
 
    if (!title || !href || !priceText || !ratingClass) {
      return;
    }
 
    books.push({
      title,
      priceCents: parsePriceCents(priceText),
      rating: RATING_MAP.get(ratingClass),
      url: new URL(href, pageUrl).href,
    });
  });
 
  return books;
}

There are a few intentional choices here:

  • Prices are stored as cents instead of floating-point dollars.
  • new URL(href, pageUrl) handles relative links correctly.
  • Missing nodes cause the item to be skipped instead of crashing the whole job.
  • The parser accepts html and pageUrl, which makes it easy to test with saved fixtures.

For stricter data quality, collect skipped records and log why they were skipped. In production, a sudden jump in skipped items is often the first sign that the target site's HTML changed.

Handle pagination without duplicate fetches

Many tutorials accidentally fetch the same page twice: once to parse items and once to find the next link. Do both from the same HTML response.

import * as cheerio from "cheerio";
import { setTimeout as sleep } from "node:timers/promises";
 
async function fetchHtml(url) {
  const response = await fetch(url, {
    headers: HEADERS,
    signal: AbortSignal.timeout(15_000),
  });
 
  if (!response.ok) {
    throw new Error(`Request failed: ${response.status} ${response.statusText}`);
  }
 
  return response.text();
}
 
function findNextPage(html, pageUrl) {
  const $ = cheerio.load(html);
  const href = $("li.next a[href]").first().attr("href");
  return href ? new URL(href, pageUrl).href : null;
}
 
async function crawlBooks(startUrl, { delayMs = 250, maxPages = 3 } = {}) {
  const books = [];
  let pageUrl = startUrl;
  let pagesSeen = 0;
 
  while (pageUrl && pagesSeen < maxPages) {
    const html = await fetchHtml(pageUrl);
    books.push(...parseBooks(html, pageUrl));
    pageUrl = findNextPage(html, pageUrl);
    pagesSeen += 1;
    await sleep(delayMs);
  }
 
  return books;
}
 
const books = await crawlBooks("https://books.toscrape.com/catalogue/page-1.html");
console.log(`Scraped ${books.length} books`);

For numbered pagination, a range loop is fine when you know the bounds:

for (let pageNumber = 1; pageNumber <= 3; pageNumber += 1) {
  const url = `https://books.toscrape.com/catalogue/page-${pageNumber}.html`;
  const html = await fetchHtml(url);
  const items = parseBooks(html, url);
  console.log(pageNumber, items.length);
}

When you do not know the last page, follow the next link until it disappears, or parse the last page number from the pagination controls on the first page.

Use bounded concurrency

Node makes it easy to start many requests at once. That does not mean you should. Promise.all(urls.map(fetch)) over 10,000 URLs can overwhelm your process and the target site. Use a worker queue or semaphore to keep concurrency under control.

async function mapWithConcurrency(items, concurrency, worker) {
  const results = new Array(items.length);
  let nextIndex = 0;
 
  async function runWorker() {
    while (nextIndex < items.length) {
      const currentIndex = nextIndex;
      nextIndex += 1;
      results[currentIndex] = await worker(items[currentIndex], currentIndex);
    }
  }
 
  const workers = Array.from(
    { length: Math.min(concurrency, items.length) },
    () => runWorker(),
  );
 
  await Promise.all(workers);
  return results;
}
 
const urls = Array.from(
  { length: 3 },
  (_, index) => `https://books.toscrape.com/catalogue/page-${index + 1}.html`,
);
 
const htmlPages = await mapWithConcurrency(urls, 2, async (url) => fetchHtml(url));
console.log(htmlPages.map((page) => page.length));

Start with low concurrency, measure success rate, then increase slowly. If latency drops but error rates climb, you are not making the scraper better. You are just making it louder.

For serious pipelines, return structured results instead of raw strings:

async function fetchResult(url) {
  try {
    const response = await fetch(url, {
      headers: HEADERS,
      signal: AbortSignal.timeout(20_000),
    });
 
    const body = await response.text();
    return {
      url,
      status: response.status,
      ok: response.ok,
      html: response.ok ? body : null,
      error: response.ok ? null : body.slice(0, 500),
    };
  } catch (error) {
    return {
      url,
      status: null,
      ok: false,
      html: null,
      error: error instanceof Error ? error.message : String(error),
    };
  }
}
 
const results = await mapWithConcurrency(urls, 2, fetchResult);
console.log(results.map((result) => ({ url: result.url, ok: result.ok })));

That shape lets the rest of the pipeline continue even when a few URLs fail.

Use Playwright for JavaScript-rendered pages

If await fetch(url).then((r) => r.text()) returns a mostly empty shell but your browser shows real content, the page may be rendering data with JavaScript. Playwright runs Chromium, Firefox, or WebKit programmatically, so you can wait for the rendered DOM and then parse it.

import { chromium } from "playwright";
 
async function renderHtml(url, selector) {
  const browser = await chromium.launch({ headless: true });
 
  try {
    const context = await browser.newContext({
      userAgent: HEADERS["user-agent"],
      viewport: { width: 1280, height: 800 },
    });
 
    const page = await context.newPage();
    await page.goto(url, { waitUntil: "domcontentloaded", timeout: 30_000 });
    await page.waitForSelector(selector, { timeout: 15_000 });
    return await page.content();
  } finally {
    await browser.close();
  }
}
 
const renderedHtml = await renderHtml("https://quotes.toscrape.com/js/", ".quote");
console.log("Rendered bytes:", renderedHtml.length);

Prefer waiting for a specific selector over waiting for vague page states. The selector says, "the data I need is present." A generic network idle wait can be flaky on pages with analytics, ads, live chat, or long-polling requests.

Playwright is also useful when you need interaction:

async function clickLoadMoreUntilDone(page) {
  while (true) {
    const button = page.getByRole("button", { name: "Load more" });
 
    if ((await button.count()) === 0) {
      return;
    }
 
    await button.click();
    await page.waitForTimeout(750);
  }
}

Use browser automation sparingly. Running hundreds of browser contexts costs memory and CPU. If you only need one JSON response that the page fetches after load, call that endpoint directly instead.

Inspect the Network tab before parsing the DOM

Many "JavaScript scraping" tasks are really API discovery tasks. Open DevTools, refresh the page, and filter Network requests by Fetch/XHR. Look for JSON responses that contain the data you need. If you find one, you can often replace a fragile browser scraper with a simple HTTP request.

const apiUrl = new URL("https://dummyjson.com/products");
apiUrl.search = new URLSearchParams({ limit: "3", skip: "0" });
 
const response = await fetch(apiUrl, {
  headers: HEADERS,
  signal: AbortSignal.timeout(15_000),
});
 
if (!response.ok) {
  throw new Error(`API request failed: ${response.status}`);
}
 
const data = await response.json();
 
for (const item of data.products) {
  console.log(item.title, item.price);
}

Be careful here. An endpoint being visible in DevTools does not mean it is open for unrestricted use. Check whether the request depends on authentication, CSRF tokens, signed URLs, or Terms of Service restrictions. The right conclusion might be "ask for API access", not "copy every private request header."

Validate extracted data

Scrapers fail quietly when every field is just "some string." A price becomes "Sold out", a date becomes "Coming soon", or a selector starts returning a promotional badge instead of a product title. Add validation close to the parser.

Zod is a good fit for Node because it can validate runtime data and document the shape at the same time:

import { z } from "zod";
 
const BookRecord = z.object({
  title: z.string().trim().min(1),
  priceCents: z.number().int().nonnegative(),
  rating: z.number().int().min(1).max(5),
  url: z.string().url(),
});
 
function validateBooks(rawBooks) {
  const valid = [];
  const invalid = [];
 
  for (const rawBook of rawBooks) {
    const result = BookRecord.safeParse(rawBook);
    if (result.success) {
      valid.push(result.data);
    } else {
      invalid.push({ rawBook, error: result.error.flatten() });
    }
  }
 
  return { valid, invalid };
}
 
const validation = validateBooks(books);
console.log(validation.valid.length, validation.invalid.length);

Validation should not make your scraper brittle. It should make failures visible. Store invalid records separately with the source URL, raw value, and error message so you can fix the parser without guessing.

Store results in the right format

CSV is still the easiest format for spreadsheets and quick inspection:

import { mkdir, writeFile } from "node:fs/promises";
import { dirname } from "node:path";
 
function csvValue(value) {
  const text = String(value ?? "");
  if (/[",\n\r]/.test(text)) {
    return `"${text.replaceAll('"', '""')}"`;
  }
  return text;
}
 
async function saveBooksCsv(books, path) {
  const header = ["title", "priceCents", "rating", "url"];
  const rows = books.map((book) =>
    header.map((field) => csvValue(book[field])).join(","),
  );
 
  await mkdir(dirname(path), { recursive: true });
  await writeFile(path, [header.join(","), ...rows].join("\n"), "utf8");
}

JSON is better for nested records and API handoff:

async function saveBooksJson(books, path) {
  await mkdir(dirname(path), { recursive: true });
  await writeFile(path, JSON.stringify(books, null, 2), "utf8");
}

JSONL is a good default for large jobs because it lets you append one record per line and stream the file later:

import { createWriteStream } from "node:fs";
 
async function saveBooksJsonl(books, path) {
  await mkdir(dirname(path), { recursive: true });
 
  const stream = createWriteStream(path, { encoding: "utf8" });
  for (const book of books) {
    stream.write(`${JSON.stringify(book)}\n`);
  }
 
  await new Promise((resolve, reject) => {
    stream.end(resolve);
    stream.on("error", reject);
  });
}

For a production database, keep the same shape: parse a page, validate records, then upsert by a stable key such as canonical URL or product ID. Whether you use Postgres, SQLite, BigQuery, or a warehouse, avoid holding every record in memory when the crawl might grow to millions of rows.

Test parsers with saved fixtures

The fetcher talks to the network. The parser should not need to. Save representative HTML files and test extraction against those fixtures. This catches selector drift, protects you from accidental parser regressions, and lets you work locally without hitting the target site on every test run.

Node ships with a built-in test runner:

import test from "node:test";
import assert from "node:assert/strict";
 
test("parseBooks extracts a complete product card", () => {
  const fixture = `
    <html>
      <body>
        <article class="product_pod">
          <p class="star-rating Three"></p>
          <h3><a href="https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light</a></h3>
          <p class="price_color">$51.77</p>
        </article>
      </body>
    </html>
  `;
 
  const parsed = parseBooks(fixture, "https://books.toscrape.com/catalogue/page-1.html");
 
  assert.equal(parsed.length, 1);
  assert.equal(parsed[0].title, "A Light in the Attic");
  assert.equal(parsed[0].priceCents, 5177);
  assert.equal(parsed[0].rating, 3);
  assert.ok(parsed[0].url.startsWith("https://books.toscrape.com/"));
});

The most valuable fixtures are not perfect pages. Keep examples of edge cases:

  • A normal listing page
  • A page with missing prices or empty fields
  • A page with a changed card layout
  • A page that returned a soft error inside a 200 response
  • A page with no results
  • A page with unusual characters, currencies, or encodings

When a production run fails, save the raw HTML and add it as a fixture before changing the parser. That gives you a regression test for the exact breakage.

You can also test failure behavior directly:

test("parseBooks skips incomplete cards", () => {
  const fixture = `
    <html>
      <body>
        <article class="product_pod">
          <h3><a href="https://example.com" title="Broken book"></a></h3>
        </article>
      </body>
    </html>
  `;
 
  const parsed = parseBooks(fixture, "https://example.com/catalogue/page-1.html");
  assert.deepEqual(parsed, []);
});

That test is small, but it documents an important rule: incomplete cards should not crash the run.

Cache pages and crawl incrementally

Caching makes scrapers cheaper, faster, and easier to debug. If a page has not changed since your last run, you may not need to fetch it again. At minimum, keep a local record of visited URLs, last fetch time, status code, content hash, and extracted record count.

import { createHash } from "node:crypto";
 
function hashHtml(html) {
  return createHash("sha256").update(html, "utf8").digest("hex");
}
 
function snapshotPage({ url, status, html, recordCount }) {
  return {
    url,
    status,
    contentHash: hashHtml(html),
    recordCount,
    scrapedAt: new Date().toISOString(),
  };
}

For sites that support HTTP caching, preserve ETag and Last-Modified headers. On the next run, send If-None-Match or If-Modified-Since. A 304 Not Modified response means the server is telling you the page has not changed, so you can reuse your previous parse result.

async function fetchIfChanged(url, previousEtag) {
  const response = await fetch(url, {
    headers: {
      ...HEADERS,
      "if-none-match": previousEtag,
    },
    signal: AbortSignal.timeout(15_000),
  });
 
  if (response.status === 304) {
    return null;
  }
 
  if (!response.ok) {
    throw new Error(`Request failed: ${response.status}`);
  }
 
  return {
    html: await response.text(),
    etag: response.headers.get("etag"),
  };
}

Incremental crawling is the same idea at the URL level. Instead of crawling every page every time, prioritize pages that are new, recently changed, important to the business, or historically unstable. For ecommerce sites, category pages may need frequent refreshes, while old product detail pages can be checked less often. For documentation sites, sitemap lastmod values can help you choose what to revisit first.

Keep incremental logic simple until the data proves it needs to be fancy:

  • New URLs go to the front of the queue.
  • Failed URLs get a limited number of retries.
  • Recently changed URLs are revisited sooner.
  • Stable URLs are revisited later.
  • Removed pages are marked inactive instead of immediately deleted.

This approach also helps when a run is interrupted. If every page has a status in your database, the next run can resume from unfinished URLs instead of starting over.

Make failures observable

A scraper without logs is a guessing game. At minimum, log the URL, status code, elapsed time, content type, byte count, record count, and parser errors.

import { performance } from "node:perf_hooks";
 
async function fetchWithLogging(url) {
  const started = performance.now();
  const response = await fetch(url, {
    headers: HEADERS,
    signal: AbortSignal.timeout(15_000),
  });
  const elapsedMs = Math.round(performance.now() - started);
  const body = await response.text();
 
  console.info("fetched", {
    url,
    status: response.status,
    elapsedMs,
    bytes: body.length,
    contentType: response.headers.get("content-type"),
  });
 
  if (!response.ok) {
    throw new Error(`Request failed: ${response.status}`);
  }
 
  return body;
}

Track these metrics over time:

  • Fetch success rate
  • Parser success rate
  • Records extracted per page
  • Duplicate rate
  • Median and p95 fetch latency
  • Count of 403, 404, 429, and 5xx responses
  • Number of pages skipped by robots or policy rules

The most useful alert is often "records per page dropped to zero." It catches empty HTML, selector drift, bot pages, and broken JavaScript rendering in one signal.

Always save raw HTML for failed pages:

import { mkdir, writeFile } from "node:fs/promises";
import { createHash } from "node:crypto";
 
async function saveDebugHtml(url, html) {
  const digest = createHash("sha256").update(url).digest("hex").slice(0, 12);
  const path = `data/raw/${digest}.html`;
  await mkdir("data/raw", { recursive: true });
  await writeFile(path, html, "utf8");
  return path;
}

That file is the fastest way to answer, "Did fetching fail, or did parsing fail?"

Debug common scraping failures

When a scraper breaks, do not immediately add proxies or a headless browser. First, identify the failure mode.

SymptomLikely causeWhat to check
403 ForbiddenAccess denied, policy block, missing session, or bot protectionCompare headers, cookies, robots, Terms, and browser behavior
429 Too Many RequestsRate limitRead Retry-After, reduce concurrency, add backoff
200 with no recordsSelector drift or rendered contentSave HTML, inspect it, compare with browser DOM
200 with CAPTCHA contentAutomated access blockedStop or use an approved access path
TimeoutSlow server, network issue, heavy pageIncrease read timeout, retry gently, reduce concurrency
Garbled textWrong encodingCheck content type, response bytes, and parser output

Start with evidence:

const response = await fetch("https://example.com/products", {
  headers: HEADERS,
  signal: AbortSignal.timeout(15_000),
});
 
const body = await response.text();
 
console.log("status:", response.status);
console.log("server:", response.headers.get("server"));
console.log("content-type:", response.headers.get("content-type"));
console.log("body preview:", body.slice(0, 600));

A status code tells you almost nothing by itself. The body often tells you everything: a Cloudflare challenge, a CAPTCHA page, an application error, a login wall, or a normal page with changed markup.

When to consider Crawlee or a queue

This guide focuses on built-in fetch, Cheerio, and Playwright because they are easy to understand one piece at a time. Crawlee, BullMQ, Temporal, or a custom queue can be worth evaluating when your project starts needing crawler features:

  • URL scheduling and deduplication
  • Per-domain concurrency limits
  • Persistent queues
  • Built-in retry and redirect handling
  • Browser pool management
  • Item pipelines for validation and storage
  • Incremental crawls
  • Crawl depth controls
  • Large crawl observability

A framework has a steeper learning curve than a small script, but it pays off when the crawl itself becomes a long-lived system. A common path is to prototype extraction with Cheerio, then move the project into a crawler framework or queue once you know the target pages, item schema, and crawl rules.

When to use a scraping API

At some point, building scraper infrastructure becomes the job. Browser pools, proxy procurement, CAPTCHA handling, fingerprint consistency, retries, monitoring, and selector maintenance can consume more time than the data product you meant to build.

Build your own scraper when:

  • The target pages are simple and stable
  • You have permission or own the source
  • The extraction logic is highly custom
  • Cost per page matters enough to justify ongoing engineering work
  • Your team is comfortable operating crawlers

Use a managed scraping API when:

  • You need clean Markdown, HTML, screenshots, or structured output quickly
  • You do not want to manage browser infrastructure
  • The target sites change frequently
  • Scraping is an input to your product, not the product itself
  • Reliability matters more than owning every low-level detail

Context.dev provides Web APIs that return clean Markdown, raw rendered HTML, screenshots, search results, sitemap crawls, image manifests, and structured extraction results from public URLs. That is useful when the output is going into an LLM, a RAG index, an enrichment workflow, or an agent that needs current web context. Instead of fetching raw HTML, removing navigation, stripping cookie banners, rendering JavaScript, and compressing the result yourself, you can ask for Markdown that is already shaped for model input.

Useful endpoints from the current docs:

EndpointUse it when you need
GET /web/scrape/markdownClean, LLM-ready Markdown from one URL
GET /web/scrape/htmlRaw rendered HTML from one URL
GET /web/scrape/imagesAn image manifest from one URL
GET /web/scrape/sitemapURLs discovered from a domain sitemap
POST /web/crawlMarkdown for multiple pages starting from one URL
POST /web/searchWeb search results, optionally scraped to Markdown in the same call
GET /web/screenshotA fresh viewport or full-page screenshot
POST /web/extractStructured JSON from a website using your schema
GET /brand/retrieveBrand profile data such as name, logos, colors, fonts, socials, and company metadata

Here is the raw fetch version of a token-conscious Markdown scrape. useMainContentOnly removes page chrome where detectable, includeImages=false avoids image references, and shortenBase64Images=true prevents inline image payloads from dominating the response.

const apiKey = process.env.CONTEXT_DEV_API_KEY;
if (!apiKey) {
  throw new Error("Set CONTEXT_DEV_API_KEY before calling Context.dev");
}
 
const params = new URLSearchParams({
  url: "https://example.com",
  useMainContentOnly: "true",
  includeLinks: "true",
  includeImages: "false",
  shortenBase64Images: "true",
});
 
const response = await fetch(`https://api.context.dev/v1/web/scrape/markdown?${params}`, {
  headers: { authorization: `Bearer ${apiKey}` },
  signal: AbortSignal.timeout(30_000),
});
 
if (!response.ok) {
  throw new Error(`Context.dev request failed: ${response.status}`);
}
 
const data = await response.json();
console.log(data.markdown);

If you prefer the official SDK, install context.dev:

npm install context.dev

Then call the Web API from Node:

import ContextDev from "context.dev";
 
const client = new ContextDev({
  apiKey: process.env.CONTEXT_DEV_API_KEY,
});
 
const response = await client.web.webScrapeMd({
  url: "https://example.com",
  useMainContentOnly: true,
});
 
console.log(response.markdown);

For rendered HTML, switch to the HTML endpoint:

const apiKey = process.env.CONTEXT_DEV_API_KEY;
if (!apiKey) {
  throw new Error("Set CONTEXT_DEV_API_KEY before calling Context.dev");
}
 
const params = new URLSearchParams({
  url: "https://example.com",
  useMainContentOnly: "true",
});
 
const response = await fetch(`https://api.context.dev/v1/web/scrape/html?${params}`, {
  headers: { authorization: `Bearer ${apiKey}` },
  signal: AbortSignal.timeout(30_000),
});
 
if (!response.ok) {
  throw new Error(`Context.dev request failed: ${response.status}`);
}
 
const data = await response.json();
console.log(data.html.slice(0, 500));

For a small crawl, use POST /web/crawl and cap the job. The endpoint returns an array of pages with Markdown and metadata, which is the shape you usually want before embedding content for search or RAG.

const apiKey = process.env.CONTEXT_DEV_API_KEY;
if (!apiKey) {
  throw new Error("Set CONTEXT_DEV_API_KEY before calling Context.dev");
}
 
const response = await fetch("https://api.context.dev/v1/web/crawl", {
  method: "POST",
  headers: {
    authorization: `Bearer ${apiKey}`,
    "content-type": "application/json",
  },
  body: JSON.stringify({
    url: "https://example.com",
    maxPages: 3,
    maxDepth: 1,
    useMainContentOnly: true,
  }),
  signal: AbortSignal.timeout(90_000),
});
 
if (!response.ok) {
  throw new Error(`Context.dev crawl failed: ${response.status}`);
}
 
const data = await response.json();
for (const page of data.results) {
  console.log(page.metadata.url, page.markdown.length);
}

For agents, POST /web/search can search the web and scrape each result to Markdown in one round-trip:

const apiKey = process.env.CONTEXT_DEV_API_KEY;
if (!apiKey) {
  throw new Error("Set CONTEXT_DEV_API_KEY before calling Context.dev");
}
 
const response = await fetch("https://api.context.dev/v1/web/search", {
  method: "POST",
  headers: {
    authorization: `Bearer ${apiKey}`,
    "content-type": "application/json",
  },
  body: JSON.stringify({
    query: "Node.js web scraping robots.txt best practices",
    freshness: "last_year",
    markdownOptions: {
      enabled: true,
      useMainContentOnly: true,
      includeImages: false,
    },
  }),
  signal: AbortSignal.timeout(90_000),
});
 
if (!response.ok) {
  throw new Error(`Context.dev search failed: ${response.status}`);
}
 
const data = await response.json();
for (const result of data.results) {
  console.log(result.title, result.url, result.markdown.code);
}

You can also ask Context.dev to crawl a site and extract structured data into your schema:

const apiKey = process.env.CONTEXT_DEV_API_KEY;
if (!apiKey) {
  throw new Error("Set CONTEXT_DEV_API_KEY before calling Context.dev");
}
 
const response = await fetch("https://api.context.dev/v1/web/extract", {
  method: "POST",
  headers: {
    authorization: `Bearer ${apiKey}`,
    "content-type": "application/json",
  },
  body: JSON.stringify({
    url: "https://example.com",
    maxPages: 3,
    schema: {
      type: "object",
      properties: {
        company_name: { type: "string" },
        summary: { type: "string" },
        notable_links: {
          type: "array",
          items: {
            type: "object",
            properties: {
              title: { type: "string" },
              url: { type: "string" },
            },
            required: ["title", "url"],
            additionalProperties: false,
          },
        },
      },
      required: ["company_name", "summary", "notable_links"],
      additionalProperties: false,
    },
  }),
  signal: AbortSignal.timeout(90_000),
});
 
if (!response.ok) {
  throw new Error(`Context.dev extraction failed: ${response.status}`);
}
 
const data = await response.json();
console.log(data);

The pricing model is designed for testing. The free tier includes API credits and a Logo Link quota, so you can validate the workflow before committing to a paid plan. In production, the biggest advantage is not that one endpoint is magical. It is that Markdown conversion, browser rendering, caching, proxy escalation, screenshots, search, structured extraction, and brand data all sit behind the same authentication model and operational surface.

You can read the official Context.dev docs for scraping websites to Markdown, the Markdown endpoint, the HTML endpoint, web search, and structured website extraction.

A production checklist

Before you call a scraper production-ready, run through this checklist:

  • The target site's Terms, robots.txt, and data sensitivity have been reviewed.
  • Every HTTP request has a timeout.
  • Retries and backoff are configured intentionally.
  • Concurrency is bounded per host.
  • Raw HTML is saved for failed pages.
  • Parsers are tested against saved fixtures.
  • Extracted records have validation.
  • Storage is incremental and deduplicated.
  • Logs include URL, status, latency, byte count, and record count.
  • Alerts catch zero-record pages and spikes in 403, 429, or parser failures.
  • The scraper can resume after interruption.
  • Fetching, parsing, validation, and storage live in separate modules.
  • There is a clear owner for selector maintenance.

This checklist is intentionally boring. It prevents the common failure where a scraper works during a demo, then silently produces bad data for three weeks.

Frequently asked questions

What is the best Node library for web scraping?

For static HTML, start with built-in fetch and Cheerio. For many concurrent static pages, use fetch with a bounded worker queue. For JavaScript-rendered pages, use Playwright. For large crawls with queues, retries, browser pools, and persistence, evaluate Crawlee or a dedicated job queue. There is no single best library; there is a best fit for the page and operational needs.

Should I use Cheerio or JSDOM?

Cheerio is usually the better default for scraping static HTML because it is fast, small, and gives you a familiar selector API. JSDOM is useful when you need a more browser-like DOM implementation, but it still does not fully behave like a real browser. If the page needs JavaScript execution, use Playwright instead of trying to make a parser behave like Chromium.

How do I scrape a site that uses JavaScript?

First, inspect the Network tab for a JSON endpoint. If the data is available through a stable request that you are allowed to use, call that endpoint directly. If the page must run JavaScript or requires interaction, use Playwright, wait for a specific selector, then parse page.content() with Cheerio.

How do I avoid getting blocked?

The most reliable answer is to scrape politely and within the site's rules: reduce request volume, use reasonable delays, respect rate limits, cache pages, and prefer official APIs. If you receive a CAPTCHA, login wall, or explicit block, treat it as a stop signal unless you have permission and an approved access path.

Is Node faster than Python for scraping?

It depends on the bottleneck. For many HTTP-heavy jobs, Node's async model is ergonomic and efficient. For browser-heavy jobs, Chromium dominates the cost regardless of language. For parsing-heavy jobs, benchmark on real pages. The bigger reliability gains usually come from better crawl design, validation, caching, and observability rather than language choice.

Should I store scraped data as CSV, JSON, JSONL, or a database?

Use CSV for quick analysis and spreadsheet workflows. Use JSON for nested data and API handoff. Use JSONL for large append-only extraction jobs. Use a database when you need deduplication, incremental updates, resume behavior, multiple writers, or querying by downstream services.

When should I stop maintaining my own scraper?

When maintenance becomes a regular tax. If you spend more time fixing browser infrastructure, rate limits, selectors, and retries than using the data, a managed API is probably cheaper. The same is true when scraping is only one input to your product and reliability matters more than owning the mechanics.

Wrapping up

Good Node web scraping is disciplined data engineering. Fetch slowly and explicitly, parse defensively, validate records, store incrementally, and make failures visible. Built-in fetch plus Cheerio can take you far when pages are simple. A bounded worker queue gives you controlled concurrency. Playwright handles pages that genuinely need a browser. Crawlee or a job queue becomes useful when the crawl itself turns into a system.

The real skill is knowing when not to add more machinery. Start with the simplest scraper that returns correct data. Add retries because you have measured temporary failures. Add concurrency because the job is network-bound. Add Playwright because the content is rendered client-side. Use a managed API when infrastructure is pulling attention away from the product you actually want to build.

That is where Context.dev fits cleanly. If your goal is to feed a model, agent, onboarding flow, company intelligence feature, or research pipeline, you usually do not need to own every browser process and parser. You need dependable web context in a format your application can use. Context.dev gives you token-efficient Markdown, raw rendered HTML, crawl results, web search with optional Markdown scraping, screenshots, structured extraction, and brand data through one API, with a free tier for testing before you scale.

Ship an agent that actually knows things.

Free tier, 10-minute integration, and the same API powering agents at Mintlify, daily.dev, and Propane. No credit card to start.