Web Scraping & Crawling

What is Puppeteer?

A Node.js library from the Chrome team that drives Chromium over the DevTools Protocol, used for scraping, screenshots, PDF generation, and headless testing.

Puppeteer launches a Chromium instance (headless by default, optionally headed) and gives JavaScript code direct access to the same control surface Chrome's DevTools uses. You can navigate to URLs, evaluate scripts inside the page, intercept network requests, capture screenshots, generate PDFs, and pull the rendered DOM out as HTML. Because it talks DevTools Protocol natively, Puppeteer is faster and exposes more low-level hooks than WebDriver-based tools.

For brand-data use cases, Puppeteer is a workhorse. Rendering a page above the fold for a screenshot, waiting for a logo image to load before extracting its src, or grabbing the post-hydration HTML of a Next.js app are all one or two API calls. Stealth plugins (puppeteer-extra-plugin-stealth) patch over the obvious automation tells, which buys headroom against anti-bot vendors before you have to escalate to residential proxies.

The main tradeoffs versus Playwright are scope and ergonomics. Puppeteer is Chromium-only and has fewer built-in waits, so you write more glue code. Playwright covers Chromium, Firefox, and WebKit with a cleaner API. If you only need Chrome and the team is already on Node, Puppeteer is the lighter dependency.

In the wild

→Generating PDF invoices from an HTML template at request time
→Capturing a full-page screenshot of every URL in a sitemap
→Scraping product detail pages that lazy-load images on scroll

How Brand.dev uses puppeteer

Endpoints in the Brand.dev API where this concept comes up directly.

Web Scrape HTML API Screenshot API Markdown Scrape API

FAQ

Puppeteer vs Playwright?

Puppeteer is Chromium-only and lighter; Playwright is multi-browser, has better auto-wait semantics, and supports more languages. Most new scraping projects choose Playwright unless they specifically need a smaller surface area.

Can Puppeteer scrape sites that block bots?

Out of the box Puppeteer is detectable: navigator.webdriver is true, headless flags leak. With puppeteer-extra-stealth and rotating residential proxies you can defeat most consumer anti-bot stacks; bypassing Cloudflare Turnstile or PerimeterX is harder.

Does Puppeteer support Firefox?

There is an experimental Firefox build, but it lags behind Chromium and is not production-ready. If you need Firefox, use Playwright.

Related terms

Playwright

A Microsoft-maintained library for driving Chrome, Firefox, and WebKit headlessly with a unified API.

Headless Browser

A real browser engine running without a visible UI, controlled programmatically through an automation API.

Web Scraping

Programmatically extracting structured data from websites that were designed to be read by humans.

Web Crawler

A program that systematically follows links between web pages to discover and index content at scale.

DOM

The Document Object Model, a tree of objects that represents an HTML document in memory and lets JavaScript manipulate it.

←All glossary terms