Web Scraping with Python: The Complete 2026 Guide

Python is still the most practical language for web scraping in 2026. The ecosystem is mature, the libraries are stable, and you can move from a five-line script to a production crawler without changing languages. A good Python scraper can fetch simple HTML with requests, parse messy markup with BeautifulSoup, run concurrent jobs with httpx, render JavaScript pages with Playwright, validate extracted records, and store the result in a format your product can actually use.

The hard part is not making one request. The hard part is building a scraper that keeps working after the target site changes its HTML, slows you down, returns a CAPTCHA page, moves content into an API response, or ships a new React bundle. This guide focuses on the whole workflow, not just the first happy path.

By the end, you will have a practical mental model for:

  • Choosing the right Python scraping tool for the target page
  • Fetching pages safely with timeouts, headers, sessions, and retries
  • Parsing HTML with selectors that are readable and resilient
  • Handling pagination without duplicate requests
  • Using httpx for bounded async concurrency
  • Rendering JavaScript-heavy pages with Playwright when a browser is genuinely needed
  • Storing results in CSV, JSON, or SQLite
  • Debugging common failures like 403, 429, empty HTML, and selector drift
  • Deciding when a managed scraping API is a better use of engineering time

Start with the legal and ethical baseline

Web scraping can be legitimate, but "publicly visible" does not mean "free to collect however you want." Before you build anything production-facing, check the target site's Terms of Service, its robots.txt, the sensitivity of the data, and the amount of load your scraper will create. If you are collecting personal data, regulated data, or data behind authentication, involve legal review instead of treating scraping as a purely technical problem.

A practical baseline:

  • Prefer official APIs when they exist. They are usually more stable, documented, and contractually clear.
  • Read robots.txt. It is available at paths like https://example.com/robots.txt. It is not a complete legal framework, but it is a clear operational signal from the site owner.
  • Identify your client honestly when appropriate. For internal crawlers, a descriptive user agent and contact URL can make abuse reports easier to resolve.
  • Throttle requests. A slow scraper that runs reliably is better than an aggressive scraper that trips protection systems and creates load.
  • Avoid authentication walls and paywalls unless you have permission. Do not treat CAPTCHA pages, login requirements, or explicit blocks as puzzles to defeat.
  • Minimize data collection. Collect the fields you need, not every page and every attribute just because it is there.

The safest technical decision is often the simplest one: scrape less, request slowly, cache aggressively, and use an official API when the site provides one.

Choose the simplest tool that works

Most Python scraping projects get overcomplicated early. Start with the cheapest tool that can return the data correctly.

Target pageRecommended toolWhy
Static HTML pagerequests plus BeautifulSoupFast, simple, cheap, easy to debug
Static pages at moderate volumehttpx plus BeautifulSoupAsync I/O and connection pooling
Site with crawl rules, queues, item pipelinesScrapyBuilt-in scheduler, retries, pipelines, and middleware
JavaScript-rendered pagePlaywrightRuns a real browser and returns the rendered DOM
Heavily protected or frequently changing sitesManaged scraping APIOffloads browsers, proxies, retries, and extraction infrastructure

Use browser automation only when the page truly requires it. A headless browser is powerful, but it is also slower, heavier, and more expensive to run than a normal HTTP request. Before reaching for Playwright, open DevTools, check the Network tab, and see whether the page is loading the data from a stable JSON endpoint. Calling that endpoint directly is usually faster and easier to maintain than parsing a rendered DOM.

Set up a clean Python project

Use Python 3.12 or newer for new scraping projects. Python 3.14 is current in 2026, but 3.12 remains a comfortable baseline for teams that care about library compatibility and deployment targets.

Create a fresh virtual environment:

python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip

On Windows PowerShell, activate the environment with:

.venv\Scripts\Activate.ps1

Install the core stack:

pip install requests beautifulsoup4 lxml httpx playwright pydantic
playwright install chromium

For repeatable installs, pin versions after you have tested the scraper:

pip freeze > requirements.txt

Do not copy old pinned versions from a blog post and assume they are current. Let your own environment produce the lockfile, then update dependencies intentionally.

A small project structure is enough for most scrapers:

scraper/
  __init__.py
  fetch.py
  parse.py
  models.py
  storage.py
  run.py
data/
  raw/
  processed/
requirements.txt

That separation matters once the scraper grows. Fetching, parsing, validation, and storage fail in different ways, so keep them in different modules. It also makes tests far easier: save one HTML fixture, test the parser against it, and avoid hitting the live site on every test run.

Fetch HTML with requests

requests is the right default for a first scraper. It is synchronous, readable, and good enough for a surprising amount of production work.

import requests
 
URL = "https://books.toscrape.com/catalogue/category/books/mystery_3/index.html"
 
HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/124.0.0.0 Safari/537.36"
    ),
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
}
 
response = requests.get(URL, headers=HEADERS, timeout=15)
response.raise_for_status()
 
html = response.text
print(response.status_code)
print(len(html))

Three habits make simple scrapers much less fragile:

  • Always set a timeout. A hanging socket should not freeze your job forever.
  • Always call raise_for_status(). Silent 404 and 500 responses create confusing parser errors later.
  • Send a realistic User-Agent. The default Python client identity is often treated as low-quality automated traffic.

Query parameters should go through params, not string concatenation:

params = {"q": "laptop stand", "page": 2}
response = requests.get("https://example.com/search", params=params, headers=HEADERS, timeout=15)

Let the HTTP client encode the URL. It prevents subtle bugs with spaces, symbols, and repeated parameters.

Reuse sessions and configure retries

If you make more than one request to the same host, use a Session. It keeps TCP connections open, persists cookies, and gives you one place to configure headers.

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
 
def build_session() -> requests.Session:
    retry = Retry(
        total=3,
        connect=3,
        read=3,
        status=3,
        backoff_factor=0.5,
        status_forcelist=(429, 500, 502, 503, 504),
        allowed_methods=("GET", "HEAD"),
        respect_retry_after_header=True,
    )
 
    adapter = HTTPAdapter(max_retries=retry, pool_connections=20, pool_maxsize=20)
 
    session = requests.Session()
    session.headers.update(HEADERS)
    session.mount("http://", adapter)
    session.mount("https://", adapter)
    return session
 
session = build_session()
response = session.get(URL, timeout=15)
response.raise_for_status()

Retries should be conservative. Retrying a temporary 503 is reasonable. Retrying a 403 fifty times is not. When a site tells you to slow down with 429 Too Many Requests or a Retry-After header, slow down.

Check robots.txt before crawling

For a one-off page fetch, manually reading the site's policy may be enough. For a crawler, automate the check. Python includes urllib.robotparser, which can answer whether a user agent is allowed to fetch a URL according to the site's robots.txt.

from urllib.parse import urlparse
from urllib.robotparser import RobotFileParser
 
def can_fetch(url: str, user_agent: str = "*") -> bool:
    parsed = urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
 
    parser = RobotFileParser()
    parser.set_url(robots_url)
    parser.read()
    return parser.can_fetch(user_agent, url)
 
target = "https://books.toscrape.com/catalogue/page-1.html"
if not can_fetch(target):
    raise RuntimeError(f"robots.txt disallows scraping {target}")

Treat this as one input, not a complete decision engine. robots.txt can be missing, stale, or broad. Terms of Service, rate limits, account agreements, privacy rules, and common sense still matter.

Parse HTML with BeautifulSoup

Once you have HTML, BeautifulSoup gives you a forgiving tree API. Use the lxml parser for speed and tolerance of imperfect markup.

from bs4 import BeautifulSoup
 
soup = BeautifulSoup(html, "lxml")
 
title = soup.find("h1")
print(title.get_text(strip=True) if title else "No title")
 
links = soup.find_all("a", href=True)
for link in links[:5]:
    print(link["href"], link.get_text(" ", strip=True))
 
price = soup.select_one("p.price_color")
print(price.get_text(strip=True) if price else "No price")

For production scrapers, prefer small helper functions over inline parsing everywhere. They make missing elements explicit and keep parser code readable.

from bs4 import Tag
 
def text_or_none(node: Tag | None) -> str | None:
    if node is None:
        return None
    text = node.get_text(" ", strip=True)
    return text or None
 
def attr_or_none(node: Tag | None, name: str) -> str | None:
    if node is None:
        return None
    value = node.get(name)
    return value if isinstance(value, str) and value else None

Now missing fields are a normal case, not a surprise AttributeError.

Build a realistic product scraper

The sandbox site books.toscrape.com is useful because it behaves like a small ecommerce listing without creating load on a real retailer. Here is a parser that extracts typed records from one listing page.

from dataclasses import dataclass
from decimal import Decimal
from urllib.parse import urljoin
 
from bs4 import BeautifulSoup
 
BASE_URL = "https://books.toscrape.com/catalogue/"
RATING_MAP = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}
 
@dataclass(frozen=True)
class Book:
    title: str
    price: Decimal
    rating: int
    url: str
 
def parse_price(raw: str) -> Decimal:
    cleaned = raw.replace("£", "").strip()
    return Decimal(cleaned)
 
def parse_books(html: str, page_url: str) -> list[Book]:
    soup = BeautifulSoup(html, "lxml")
    books: list[Book] = []
 
    for article in soup.select("article.product_pod"):
        title_node = article.select_one("h3 a")
        price_node = article.select_one("p.price_color")
        rating_node = article.select_one("p.star-rating")
 
        if not title_node or not price_node or not rating_node:
            continue
 
        title = title_node.get("title", "").strip()
        href = title_node.get("href", "")
        rating_classes = rating_node.get("class", [])
        rating_name = next((c for c in rating_classes if c in RATING_MAP), None)
 
        if not title or not href or rating_name is None:
            continue
 
        books.append(
            Book(
                title=title,
                price=parse_price(price_node.get_text(strip=True)),
                rating=RATING_MAP[rating_name],
                url=urljoin(page_url, href),
            )
        )
 
    return books

There are a few intentional choices here:

  • Decimal is better than float for prices.
  • urljoin handles relative links correctly.
  • Missing nodes cause the item to be skipped instead of crashing the whole job.
  • The parser accepts html and page_url, which makes it easy to test with saved fixtures.

For stricter data quality, you can collect skipped records and log why they were skipped. In production, a sudden jump in skipped items is often the first sign that the target site's HTML changed.

Handle pagination without duplicate fetches

Many tutorials accidentally fetch the same page twice: once to parse items and once to find the next link. Do both from the same HTML response.

import time
from collections.abc import Iterator
from urllib.parse import urljoin
 
def fetch_html(session: requests.Session, url: str) -> str:
    response = session.get(url, timeout=15)
    response.raise_for_status()
    return response.text
 
def find_next_page(html: str, page_url: str) -> str | None:
    soup = BeautifulSoup(html, "lxml")
    next_link = soup.select_one("li.next a[href]")
    if not next_link:
        return None
    return urljoin(page_url, next_link["href"])
 
def crawl_books(start_url: str, delay_seconds: float = 1.0) -> Iterator[Book]:
    session = build_session()
    page_url: str | None = start_url
 
    while page_url:
        html = fetch_html(session, page_url)
        yield from parse_books(html, page_url)
        page_url = find_next_page(html, page_url)
        time.sleep(delay_seconds)
 
books = list(crawl_books("https://books.toscrape.com/catalogue/page-1.html"))
print(f"Scraped {len(books)} books")

For numbered pagination, a range loop is fine when you know the bounds:

for page_number in range(1, 51):
    url = f"https://example.com/products?page={page_number}"
    html = fetch_html(session, url)
    items = parse_products(html, url)

When you do not know the last page, follow the next link until it disappears, or parse the last page number from the pagination controls on the first page.

Use httpx for bounded async scraping

Async scraping is useful when you have many independent URLs and each request spends most of its time waiting on the network. The key word is bounded. asyncio.gather() over 10,000 URLs can overwhelm your machine and the target site. Use a semaphore or a worker queue to keep concurrency under control.

import asyncio
import httpx
 
CONCURRENCY = 8
 
async def fetch_one(client: httpx.AsyncClient, url: str, semaphore: asyncio.Semaphore) -> str:
    async with semaphore:
        response = await client.get(url, follow_redirects=True)
        response.raise_for_status()
        return response.text
 
async def fetch_many(urls: list[str]) -> list[str]:
    timeout = httpx.Timeout(connect=5.0, read=20.0, write=5.0, pool=5.0)
    limits = httpx.Limits(max_connections=20, max_keepalive_connections=10)
    semaphore = asyncio.Semaphore(CONCURRENCY)
 
    async with httpx.AsyncClient(headers=HEADERS, timeout=timeout, limits=limits) as client:
        tasks = [fetch_one(client, url, semaphore) for url in urls]
        return await asyncio.gather(*tasks)
 
urls = [f"https://books.toscrape.com/catalogue/page-{i}.html" for i in range(1, 4)]
html_pages = asyncio.run(fetch_many(urls))

Start with low concurrency, measure success rate, then increase slowly. If latency drops but error rates climb, you are not making the scraper better. You are just making it louder.

For serious async pipelines, return structured results instead of raw strings:

from dataclasses import dataclass
 
@dataclass
class FetchResult:
    url: str
    status_code: int | None
    html: str | None
    error: str | None
 
async def fetch_result(client: httpx.AsyncClient, url: str, semaphore: asyncio.Semaphore) -> FetchResult:
    async with semaphore:
        try:
            response = await client.get(url, follow_redirects=True)
            response.raise_for_status()
            return FetchResult(url=url, status_code=response.status_code, html=response.text, error=None)
        except Exception as exc:
            return FetchResult(url=url, status_code=None, html=None, error=repr(exc))

That shape lets the rest of the pipeline continue even when a few URLs fail.

Use Playwright for JavaScript-rendered pages

If requests.get(url).text returns a mostly empty shell but your browser shows real content, the page may be rendering data with JavaScript. Playwright runs Chromium, Firefox, or WebKit programmatically, so you can wait for the rendered DOM and then parse it.

from playwright.sync_api import sync_playwright
 
def render_html(url: str, selector: str) -> str:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = browser.new_context(
            user_agent=HEADERS["User-Agent"],
            viewport={"width": 1280, "height": 800},
        )
        page = context.new_page()
        page.goto(url, wait_until="domcontentloaded", timeout=30_000)
        page.wait_for_selector(selector, timeout=15_000)
        html = page.content()
        browser.close()
        return html
 
html = render_html("https://example.com/products", ".product-card")

Prefer waiting for a specific selector over waiting for vague page states. The selector says, "the data I need is present." A generic network idle wait can be flaky on pages with analytics, ads, live chat, or long-polling requests.

Playwright is also useful when you need interaction:

from playwright.sync_api import Page
 
def load_more_products(page: Page) -> None:
    while True:
        button = page.get_by_role("button", name="Load more")
        if not button.count():
            return
        button.click()
        page.wait_for_timeout(750)

Use browser automation sparingly. Running hundreds of browser contexts costs memory and CPU. If you only need one JSON response that the page fetches after load, call that endpoint directly instead.

Inspect the Network tab before parsing the DOM

Many "JavaScript scraping" tasks are really API discovery tasks. Open DevTools, refresh the page, and filter Network requests by Fetch/XHR. Look for JSON responses that contain the data you need. If you find one, you can often replace a fragile browser scraper with a simple HTTP request.

import requests
 
api_url = "https://example.com/api/products"
params = {"page": 1, "category": "chairs"}
 
response = requests.get(api_url, params=params, headers=HEADERS, timeout=15)
response.raise_for_status()
data = response.json()
 
for item in data["products"]:
    print(item["name"], item["price"])

Be careful here. An endpoint being visible in DevTools does not mean it is open for unrestricted use. Check whether the request depends on authentication, CSRF tokens, signed URLs, or Terms of Service restrictions. The right conclusion might be "ask for API access", not "copy every private request header."

Validate extracted data

Scrapers fail quietly when you let every field be a string. A price becomes "Sold out", a date becomes "Coming soon", or a selector starts returning a promotional badge instead of a product title. Add validation close to the parser.

pydantic is useful when records will move into an application or data pipeline:

from decimal import Decimal
from pydantic import BaseModel, HttpUrl, field_validator
 
class ProductRecord(BaseModel):
    title: str
    price: Decimal
    rating: int
    url: HttpUrl
 
    @field_validator("title")
    @classmethod
    def title_must_not_be_blank(cls, value: str) -> str:
        value = value.strip()
        if not value:
            raise ValueError("title is blank")
        return value
 
    @field_validator("rating")
    @classmethod
    def rating_must_be_valid(cls, value: int) -> int:
        if value < 0 or value > 5:
            raise ValueError("rating must be between 0 and 5")
        return value

Validation should not make your scraper brittle. It should make failures visible. Store invalid records separately with the source URL, raw value, and error message so you can fix the parser without guessing.

Store results in the right format

CSV is still the easiest format for spreadsheets and quick inspection:

import csv
from dataclasses import asdict
from pathlib import Path
 
def save_books_csv(books: list[Book], path: str) -> None:
    output = Path(path)
    output.parent.mkdir(parents=True, exist_ok=True)
 
    with output.open("w", newline="", encoding="utf-8") as file:
        writer = csv.DictWriter(file, fieldnames=["title", "price", "rating", "url"])
        writer.writeheader()
        for book in books:
            row = asdict(book)
            row["price"] = str(book.price)
            writer.writerow(row)

JSON is better for nested records and API handoff:

import json
 
def save_books_json(books: list[Book], path: str) -> None:
    rows = []
    for book in books:
        row = asdict(book)
        row["price"] = str(book.price)
        rows.append(row)
 
    with open(path, "w", encoding="utf-8") as file:
        json.dump(rows, file, ensure_ascii=False, indent=2)

SQLite is the sweet spot when you need deduplication, incremental updates, and simple querying without running a database server:

import sqlite3
 
def init_db(path: str = "data/books.sqlite") -> sqlite3.Connection:
    conn = sqlite3.connect(path)
    conn.execute(
        """
        CREATE TABLE IF NOT EXISTS books (
            url TEXT PRIMARY KEY,
            title TEXT NOT NULL,
            price TEXT NOT NULL,
            rating INTEGER NOT NULL,
            scraped_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP
        )
        """
    )
    return conn
 
def upsert_books(conn: sqlite3.Connection, books: list[Book]) -> None:
    conn.executemany(
        """
        INSERT INTO books (url, title, price, rating)
        VALUES (?, ?, ?, ?)
        ON CONFLICT(url) DO UPDATE SET
            title = excluded.title,
            price = excluded.price,
            rating = excluded.rating,
            scraped_at = CURRENT_TIMESTAMP
        """,
        [(book.url, book.title, str(book.price), book.rating) for book in books],
    )
    conn.commit()

For large jobs, write incrementally. Holding every record in memory is fine for 500 rows and foolish for 5 million. Stream pages through parse, validate, and store steps as soon as each page finishes.

Test parsers with saved fixtures

The fetcher talks to the network. The parser should not need to. Save representative HTML files and test extraction against those fixtures. This catches selector drift, protects you from accidental parser regressions, and lets you work locally without hammering the target site.

A simple fixture workflow:

from pathlib import Path
 
def load_fixture(name: str) -> str:
    return Path("tests/fixtures").joinpath(name).read_text(encoding="utf-8")
 
def test_parse_books_page() -> None:
    html = load_fixture("books-page-1.html")
    books = parse_books(html, "https://books.toscrape.com/catalogue/page-1.html")
 
    assert len(books) == 20
    assert books[0].title
    assert books[0].url.startswith("https://books.toscrape.com/")
    assert 0 <= books[0].rating <= 5

The most valuable fixtures are not perfect pages. Keep examples of edge cases:

  • A normal listing page
  • A page with missing prices or empty fields
  • A page with a changed card layout
  • A page that returned a soft error inside a 200 response
  • A page with no results
  • A page with unusual characters, currencies, or encodings

When a production run fails, save the raw HTML and add it as a fixture before changing the parser. That gives you a regression test for the exact breakage. Over time, your fixture folder becomes a map of the target site's weirdness.

You can also test failure behavior directly:

def test_parse_books_skips_incomplete_cards() -> None:
    html = """
    <html>
      <body>
        <article class="product_pod">
          <h3><a href="book.html" title="Broken book"></a></h3>
        </article>
      </body>
    </html>
    """
 
    books = parse_books(html, "https://example.com/catalogue/page-1.html")
    assert books == []

That test is small, but it documents an important rule: incomplete cards should not crash the run.

Cache pages and crawl incrementally

Caching makes scrapers cheaper, faster, and easier to debug. If a page has not changed since your last run, you may not need to fetch it again. At minimum, keep a local record of visited URLs, last fetch time, status code, content hash, and extracted record count.

import hashlib
from dataclasses import dataclass
 
@dataclass
class PageSnapshot:
    url: str
    status_code: int
    content_hash: str
    record_count: int
 
def hash_html(html: str) -> str:
    return hashlib.sha256(html.encode("utf-8")).hexdigest()
 
def snapshot_page(url: str, status_code: int, html: str, record_count: int) -> PageSnapshot:
    return PageSnapshot(
        url=url,
        status_code=status_code,
        content_hash=hash_html(html),
        record_count=record_count,
    )

For sites that support HTTP caching, preserve ETag and Last-Modified headers. On the next run, send If-None-Match or If-Modified-Since. A 304 Not Modified response means the server is telling you the page has not changed, so you can reuse your previous parse result.

headers = dict(HEADERS)
headers["If-None-Match"] = previous_etag
 
response = session.get(url, headers=headers, timeout=15)
if response.status_code == 304:
    print("Page unchanged, reusing cached result")
else:
    response.raise_for_status()

Incremental crawling is the same idea at the URL level. Instead of crawling every page every time, prioritize pages that are new, recently changed, important to the business, or historically unstable. For ecommerce sites, category pages may need frequent refreshes, while old product detail pages can be checked less often. For documentation sites, sitemap lastmod values can help you choose what to revisit first.

Keep incremental logic simple until the data proves it needs to be fancy:

  • New URLs go to the front of the queue.
  • Failed URLs get a limited number of retries.
  • Recently changed URLs are revisited sooner.
  • Stable URLs are revisited later.
  • Removed pages are marked inactive instead of immediately deleted.

This approach also helps when a run is interrupted. If every page has a status in your database, the next run can resume from unfinished URLs instead of starting over.

Make failures observable

A scraper without logs is a guessing game. At minimum, log the URL, status code, elapsed time, content type, record count, and parser errors.

import logging
import time
 
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(message)s",
)
 
def fetch_with_logging(session: requests.Session, url: str) -> str:
    started = time.perf_counter()
    response = session.get(url, timeout=15)
    elapsed_ms = int((time.perf_counter() - started) * 1000)
 
    logging.info(
        "fetched url=%s status=%s elapsed_ms=%s bytes=%s",
        url,
        response.status_code,
        elapsed_ms,
        len(response.content),
    )
 
    response.raise_for_status()
    return response.text

Track these metrics over time:

  • Fetch success rate
  • Parser success rate
  • Records extracted per page
  • Duplicate rate
  • Median and p95 fetch latency
  • Count of 403, 404, 429, and 5xx responses
  • Number of pages skipped by robots or policy rules

The most useful alert is often "records per page dropped to zero." It catches empty HTML, selector drift, bot pages, and broken JavaScript rendering in one signal.

Debug common scraping failures

When a scraper breaks, do not immediately add proxies or a headless browser. First, identify the failure mode.

SymptomLikely causeWhat to check
403 ForbiddenAccess denied, policy block, missing session, or bot protectionCompare headers, cookies, robots, Terms, and browser behavior
429 Too Many RequestsRate limitRead Retry-After, reduce concurrency, add backoff
200 with no recordsSelector drift or rendered contentSave HTML, inspect it, compare with browser DOM
200 with CAPTCHA contentAutomated access blockedStop or use an approved access path
TimeoutSlow server, network issue, heavy pageIncrease read timeout, retry gently, reduce concurrency
Garbled textWrong encodingCheck response.encoding, content type, and parser output

Always save the raw HTML for failed pages:

from pathlib import Path
import hashlib
 
def save_debug_html(url: str, html: str) -> Path:
    digest = hashlib.sha256(url.encode("utf-8")).hexdigest()[:12]
    path = Path("data/raw") / f"{digest}.html"
    path.parent.mkdir(parents=True, exist_ok=True)
    path.write_text(html, encoding="utf-8")
    return path

That file is the fastest way to answer, "Did fetching fail, or did parsing fail?"

Rate limiting and backoff

Polite scraping is not just ethics. It improves reliability. A simple delay is better than nothing, but production scrapers should respond to server signals.

import email.utils
import random
from datetime import datetime, timezone
 
def retry_after_seconds(value: str | None) -> float | None:
    if not value:
        return None
 
    if value.isdigit():
        return float(value)
 
    try:
        retry_at = email.utils.parsedate_to_datetime(value)
    except (TypeError, ValueError):
        return None
 
    if retry_at.tzinfo is None:
        retry_at = retry_at.replace(tzinfo=timezone.utc)
 
    return max(0.0, (retry_at - datetime.now(timezone.utc)).total_seconds())
 
def polite_sleep(base: float = 1.0, jitter: float = 1.5) -> None:
    time.sleep(base + random.random() * jitter)

Use exponential backoff for temporary failures:

def backoff_delay(attempt: int, base: float = 0.75, cap: float = 30.0) -> float:
    delay = min(cap, base * (2 ** attempt))
    return delay + random.uniform(0, delay * 0.25)

Backoff is not a way to force your way through a block. It is a way to be less noisy when a server is busy or rate limiting you.

When to consider Scrapy

This guide focuses on requests, BeautifulSoup, httpx, and Playwright because they are easy to understand one piece at a time. Scrapy is worth evaluating when your project starts needing framework features:

  • URL scheduling and deduplication
  • Per-domain concurrency limits
  • Built-in retry and redirect middleware
  • Item pipelines for validation and storage
  • Incremental crawls
  • Crawl depth controls
  • Large crawl observability

Scrapy has a steeper learning curve than a small script, but it pays off when the crawler itself becomes a long-lived system. A common path is to prototype extraction with BeautifulSoup, then move the project into Scrapy once you know the target pages, item schema, and crawl rules.

When to use a scraping API

At some point, building scraper infrastructure becomes the job. Browser pools, proxy procurement, CAPTCHA handling, fingerprint consistency, retries, monitoring, and selector maintenance can consume more time than the data product you meant to build.

Build your own scraper when:

  • The target pages are simple and stable
  • You have permission or own the source
  • The extraction logic is highly custom
  • Cost per page matters enough to justify ongoing engineering work
  • Your team is comfortable operating crawlers

Use a managed scraping API when:

  • You need clean Markdown, HTML, screenshots, or structured output quickly
  • You do not want to manage browser infrastructure
  • The target sites change frequently
  • Scraping is an input to your product, not the product itself
  • Reliability matters more than owning every low-level detail

Context.dev provides web scraping endpoints that return clean Markdown or raw HTML from a URL, so you can focus on the extraction and application layer.

import requests
 
response = requests.get(
    "https://api.context.dev/v1/scrape/markdown",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    params={"url": "https://example.com/products"},
    timeout=30,
)
response.raise_for_status()
 
data = response.json()
print(data["markdown"])

You can also use Context.dev for crawling an entire site, extracting raw HTML, and scraping any URL to clean Markdown.

For a deeper walkthrough of extracting HTML through an API, read how to extract raw HTML from any URL with a single API call. If you work in several languages, the guide to web scraping in any language covers Node.js, Go, Ruby, and more.

A production checklist

Before you call a scraper production-ready, run through this checklist:

  • The target site's Terms, robots.txt, and data sensitivity have been reviewed.
  • Every HTTP request has a timeout.
  • Sessions, retries, and backoff are configured intentionally.
  • Concurrency is bounded per host.
  • Raw HTML is saved for failed pages.
  • Parsers are tested against saved fixtures.
  • Extracted records have validation.
  • Storage is incremental and deduplicated.
  • Logs include URL, status, latency, and record count.
  • Alerts catch zero-record pages and spikes in 403, 429, or parser failures.
  • The scraper can resume after interruption.
  • The code separates fetching, parsing, validation, and storage.
  • There is a clear owner for selector maintenance.

This checklist is boring in the best way. It prevents the common failure where a scraper works during a demo, then silently produces bad data for three weeks.

Frequently asked questions

What is the best Python library for web scraping?

For static HTML, start with requests and BeautifulSoup. For many concurrent static pages, use httpx with bounded concurrency. For JavaScript-rendered pages, use Playwright. For large crawls with scheduling, retry policies, and pipelines, evaluate Scrapy. There is no single best library; there is a best fit for the page and operational needs.

Should I use BeautifulSoup or lxml directly?

BeautifulSoup is easier to read and teach. Using it with the lxml parser gives you a good mix of ergonomics and speed. Direct lxml can be faster and gives strong XPath support, but it is less approachable for many teams. If parser speed is your bottleneck, benchmark both on real pages before switching.

How do I scrape a site that uses JavaScript?

First, inspect the Network tab for a JSON endpoint. If the data is available through a stable request that you are allowed to use, call that endpoint directly. If the page must run JavaScript or requires interaction, use Playwright, wait for a specific selector, then parse page.content() with your normal HTML parser.

How do I avoid getting blocked?

The most reliable answer is to scrape politely and within the site's rules: reduce request volume, use reasonable delays, respect rate limits, cache pages, and prefer official APIs. If you receive a CAPTCHA, login wall, or explicit block, treat it as a stop signal unless you have permission and an approved access path.

Is async scraping always faster?

No. Async helps when network waiting dominates and the target site can handle the request rate. It does not make parsing faster, and it can make blocking more likely if you raise concurrency carelessly. Start with small concurrency, measure success rate and latency, then tune.

Should I store scraped data as CSV, JSON, or a database?

Use CSV for quick analysis and spreadsheet workflows. Use JSON for nested data and API handoff. Use SQLite when you need deduplication, incremental updates, resume behavior, and local querying. Move to Postgres or a warehouse when multiple services or analysts need the data.

When should I stop maintaining my own scraper?

When maintenance becomes a regular tax. If you spend more time fixing browser infrastructure, rate limits, selectors, and retries than using the data, a managed API is probably cheaper. The same is true when scraping is only one input to your product and reliability matters more than owning the mechanics.

Wrapping up

Good Python web scraping is disciplined data engineering. Fetch slowly and explicitly, parse defensively, validate records, store incrementally, and make failures visible. The beginner stack of requests plus BeautifulSoup can take you far when pages are simple. httpx adds controlled concurrency. Playwright handles pages that genuinely need a browser. Scrapy becomes useful when the crawl itself turns into a system.

The real skill is knowing when not to add more machinery. Start with the simplest scraper that returns correct data. Add retries because you have measured temporary failures. Add async because the job is network-bound. Add Playwright because the content is rendered client-side. Use a managed API when infrastructure is pulling attention away from the product you actually want to build.

Ship an agent that actually knows things.

Free tier, 10-minute integration, and the same API powering agents at Mintlify, daily.dev, and Propane. No credit card to start.