Python is still the most practical language for web scraping in 2026. The ecosystem is mature, the libraries are stable, and you can move from a five-line script to a production crawler without changing languages. A good Python scraper can fetch simple HTML with requests, parse messy markup with BeautifulSoup, run concurrent jobs with httpx, render JavaScript pages with Playwright, validate extracted records, and store the result in a format your product can actually use.
The hard part is not making one request. The hard part is building a scraper that keeps working after the target site changes its HTML, slows you down, returns a CAPTCHA page, moves content into an API response, or ships a new React bundle. This guide focuses on the whole workflow, not just the first happy path.
By the end, you will have a practical mental model for:
- Choosing the right Python scraping tool for the target page
- Fetching pages safely with timeouts, headers, sessions, and retries
- Parsing HTML with selectors that are readable and resilient
- Handling pagination without duplicate requests
- Using
httpxfor bounded async concurrency - Rendering JavaScript-heavy pages with Playwright when a browser is genuinely needed
- Storing results in CSV, JSON, or SQLite
- Debugging common failures like
403,429, empty HTML, and selector drift - Deciding when a managed scraping API is a better use of engineering time
Start with the legal and ethical baseline
Web scraping can be legitimate, but "publicly visible" does not mean "free to collect however you want." Before you build anything production-facing, check the target site's Terms of Service, its robots.txt, the sensitivity of the data, and the amount of load your scraper will create. If you are collecting personal data, regulated data, or data behind authentication, involve legal review instead of treating scraping as a purely technical problem.
A practical baseline:
- Prefer official APIs when they exist. They are usually more stable, documented, and contractually clear.
- Read
robots.txt. It is available at paths likehttps://example.com/robots.txt. It is not a complete legal framework, but it is a clear operational signal from the site owner. - Identify your client honestly when appropriate. For internal crawlers, a descriptive user agent and contact URL can make abuse reports easier to resolve.
- Throttle requests. A slow scraper that runs reliably is better than an aggressive scraper that trips protection systems and creates load.
- Avoid authentication walls and paywalls unless you have permission. Do not treat CAPTCHA pages, login requirements, or explicit blocks as puzzles to defeat.
- Minimize data collection. Collect the fields you need, not every page and every attribute just because it is there.
The safest technical decision is often the simplest one: scrape less, request slowly, cache aggressively, and use an official API when the site provides one.
Choose the simplest tool that works
Most Python scraping projects get overcomplicated early. Start with the cheapest tool that can return the data correctly.
| Target page | Recommended tool | Why |
|---|---|---|
| Static HTML page | requests plus BeautifulSoup | Fast, simple, cheap, easy to debug |
| Static pages at moderate volume | httpx plus BeautifulSoup | Async I/O and connection pooling |
| Site with crawl rules, queues, item pipelines | Scrapy | Built-in scheduler, retries, pipelines, and middleware |
| JavaScript-rendered page | Playwright | Runs a real browser and returns the rendered DOM |
| Heavily protected or frequently changing sites | Managed scraping API | Offloads browsers, proxies, retries, and extraction infrastructure |
Use browser automation only when the page truly requires it. A headless browser is powerful, but it is also slower, heavier, and more expensive to run than a normal HTTP request. Before reaching for Playwright, open DevTools, check the Network tab, and see whether the page is loading the data from a stable JSON endpoint. Calling that endpoint directly is usually faster and easier to maintain than parsing a rendered DOM.
Set up a clean Python project
Use Python 3.12 or newer for new scraping projects. Python 3.14 is current in 2026, but 3.12 remains a comfortable baseline for teams that care about library compatibility and deployment targets.
Create a fresh virtual environment:
python -m venv .venv
source .venv/bin/activate
python -m pip install -U pipOn Windows PowerShell, activate the environment with:
.venv\Scripts\Activate.ps1Install the core stack:
pip install requests beautifulsoup4 lxml httpx playwright pydantic
playwright install chromiumFor repeatable installs, pin versions after you have tested the scraper:
pip freeze > requirements.txtDo not copy old pinned versions from a blog post and assume they are current. Let your own environment produce the lockfile, then update dependencies intentionally.
A small project structure is enough for most scrapers:
scraper/
__init__.py
fetch.py
parse.py
models.py
storage.py
run.py
data/
raw/
processed/
requirements.txtThat separation matters once the scraper grows. Fetching, parsing, validation, and storage fail in different ways, so keep them in different modules. It also makes tests far easier: save one HTML fixture, test the parser against it, and avoid hitting the live site on every test run.
Fetch HTML with requests
requests is the right default for a first scraper. It is synchronous, readable, and good enough for a surprising amount of production work.
import requests
URL = "https://books.toscrape.com/catalogue/category/books/mystery_3/index.html"
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
}
response = requests.get(URL, headers=HEADERS, timeout=15)
response.raise_for_status()
html = response.text
print(response.status_code)
print(len(html))Three habits make simple scrapers much less fragile:
- Always set a
timeout. A hanging socket should not freeze your job forever. - Always call
raise_for_status(). Silent404and500responses create confusing parser errors later. - Send a realistic
User-Agent. The default Python client identity is often treated as low-quality automated traffic.
Query parameters should go through params, not string concatenation:
params = {"q": "laptop stand", "page": 2}
response = requests.get("https://example.com/search", params=params, headers=HEADERS, timeout=15)Let the HTTP client encode the URL. It prevents subtle bugs with spaces, symbols, and repeated parameters.
Reuse sessions and configure retries
If you make more than one request to the same host, use a Session. It keeps TCP connections open, persists cookies, and gives you one place to configure headers.
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
def build_session() -> requests.Session:
retry = Retry(
total=3,
connect=3,
read=3,
status=3,
backoff_factor=0.5,
status_forcelist=(429, 500, 502, 503, 504),
allowed_methods=("GET", "HEAD"),
respect_retry_after_header=True,
)
adapter = HTTPAdapter(max_retries=retry, pool_connections=20, pool_maxsize=20)
session = requests.Session()
session.headers.update(HEADERS)
session.mount("http://", adapter)
session.mount("https://", adapter)
return session
session = build_session()
response = session.get(URL, timeout=15)
response.raise_for_status()Retries should be conservative. Retrying a temporary 503 is reasonable. Retrying a 403 fifty times is not. When a site tells you to slow down with 429 Too Many Requests or a Retry-After header, slow down.
Check robots.txt before crawling
For a one-off page fetch, manually reading the site's policy may be enough. For a crawler, automate the check. Python includes urllib.robotparser, which can answer whether a user agent is allowed to fetch a URL according to the site's robots.txt.
from urllib.parse import urlparse
from urllib.robotparser import RobotFileParser
def can_fetch(url: str, user_agent: str = "*") -> bool:
parsed = urlparse(url)
robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
parser = RobotFileParser()
parser.set_url(robots_url)
parser.read()
return parser.can_fetch(user_agent, url)
target = "https://books.toscrape.com/catalogue/page-1.html"
if not can_fetch(target):
raise RuntimeError(f"robots.txt disallows scraping {target}")Treat this as one input, not a complete decision engine. robots.txt can be missing, stale, or broad. Terms of Service, rate limits, account agreements, privacy rules, and common sense still matter.
Parse HTML with BeautifulSoup
Once you have HTML, BeautifulSoup gives you a forgiving tree API. Use the lxml parser for speed and tolerance of imperfect markup.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
title = soup.find("h1")
print(title.get_text(strip=True) if title else "No title")
links = soup.find_all("a", href=True)
for link in links[:5]:
print(link["href"], link.get_text(" ", strip=True))
price = soup.select_one("p.price_color")
print(price.get_text(strip=True) if price else "No price")For production scrapers, prefer small helper functions over inline parsing everywhere. They make missing elements explicit and keep parser code readable.
from bs4 import Tag
def text_or_none(node: Tag | None) -> str | None:
if node is None:
return None
text = node.get_text(" ", strip=True)
return text or None
def attr_or_none(node: Tag | None, name: str) -> str | None:
if node is None:
return None
value = node.get(name)
return value if isinstance(value, str) and value else NoneNow missing fields are a normal case, not a surprise AttributeError.
Build a realistic product scraper
The sandbox site books.toscrape.com is useful because it behaves like a small ecommerce listing without creating load on a real retailer. Here is a parser that extracts typed records from one listing page.
from dataclasses import dataclass
from decimal import Decimal
from urllib.parse import urljoin
from bs4 import BeautifulSoup
BASE_URL = "https://books.toscrape.com/catalogue/"
RATING_MAP = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}
@dataclass(frozen=True)
class Book:
title: str
price: Decimal
rating: int
url: str
def parse_price(raw: str) -> Decimal:
cleaned = raw.replace("£", "").strip()
return Decimal(cleaned)
def parse_books(html: str, page_url: str) -> list[Book]:
soup = BeautifulSoup(html, "lxml")
books: list[Book] = []
for article in soup.select("article.product_pod"):
title_node = article.select_one("h3 a")
price_node = article.select_one("p.price_color")
rating_node = article.select_one("p.star-rating")
if not title_node or not price_node or not rating_node:
continue
title = title_node.get("title", "").strip()
href = title_node.get("href", "")
rating_classes = rating_node.get("class", [])
rating_name = next((c for c in rating_classes if c in RATING_MAP), None)
if not title or not href or rating_name is None:
continue
books.append(
Book(
title=title,
price=parse_price(price_node.get_text(strip=True)),
rating=RATING_MAP[rating_name],
url=urljoin(page_url, href),
)
)
return booksThere are a few intentional choices here:
Decimalis better thanfloatfor prices.urljoinhandles relative links correctly.- Missing nodes cause the item to be skipped instead of crashing the whole job.
- The parser accepts
htmlandpage_url, which makes it easy to test with saved fixtures.
For stricter data quality, you can collect skipped records and log why they were skipped. In production, a sudden jump in skipped items is often the first sign that the target site's HTML changed.
Handle pagination without duplicate fetches
Many tutorials accidentally fetch the same page twice: once to parse items and once to find the next link. Do both from the same HTML response.
import time
from collections.abc import Iterator
from urllib.parse import urljoin
def fetch_html(session: requests.Session, url: str) -> str:
response = session.get(url, timeout=15)
response.raise_for_status()
return response.text
def find_next_page(html: str, page_url: str) -> str | None:
soup = BeautifulSoup(html, "lxml")
next_link = soup.select_one("li.next a[href]")
if not next_link:
return None
return urljoin(page_url, next_link["href"])
def crawl_books(start_url: str, delay_seconds: float = 1.0) -> Iterator[Book]:
session = build_session()
page_url: str | None = start_url
while page_url:
html = fetch_html(session, page_url)
yield from parse_books(html, page_url)
page_url = find_next_page(html, page_url)
time.sleep(delay_seconds)
books = list(crawl_books("https://books.toscrape.com/catalogue/page-1.html"))
print(f"Scraped {len(books)} books")For numbered pagination, a range loop is fine when you know the bounds:
for page_number in range(1, 51):
url = f"https://example.com/products?page={page_number}"
html = fetch_html(session, url)
items = parse_products(html, url)When you do not know the last page, follow the next link until it disappears, or parse the last page number from the pagination controls on the first page.
Use httpx for bounded async scraping
Async scraping is useful when you have many independent URLs and each request spends most of its time waiting on the network. The key word is bounded. asyncio.gather() over 10,000 URLs can overwhelm your machine and the target site. Use a semaphore or a worker queue to keep concurrency under control.
import asyncio
import httpx
CONCURRENCY = 8
async def fetch_one(client: httpx.AsyncClient, url: str, semaphore: asyncio.Semaphore) -> str:
async with semaphore:
response = await client.get(url, follow_redirects=True)
response.raise_for_status()
return response.text
async def fetch_many(urls: list[str]) -> list[str]:
timeout = httpx.Timeout(connect=5.0, read=20.0, write=5.0, pool=5.0)
limits = httpx.Limits(max_connections=20, max_keepalive_connections=10)
semaphore = asyncio.Semaphore(CONCURRENCY)
async with httpx.AsyncClient(headers=HEADERS, timeout=timeout, limits=limits) as client:
tasks = [fetch_one(client, url, semaphore) for url in urls]
return await asyncio.gather(*tasks)
urls = [f"https://books.toscrape.com/catalogue/page-{i}.html" for i in range(1, 4)]
html_pages = asyncio.run(fetch_many(urls))Start with low concurrency, measure success rate, then increase slowly. If latency drops but error rates climb, you are not making the scraper better. You are just making it louder.
For serious async pipelines, return structured results instead of raw strings:
from dataclasses import dataclass
@dataclass
class FetchResult:
url: str
status_code: int | None
html: str | None
error: str | None
async def fetch_result(client: httpx.AsyncClient, url: str, semaphore: asyncio.Semaphore) -> FetchResult:
async with semaphore:
try:
response = await client.get(url, follow_redirects=True)
response.raise_for_status()
return FetchResult(url=url, status_code=response.status_code, html=response.text, error=None)
except Exception as exc:
return FetchResult(url=url, status_code=None, html=None, error=repr(exc))That shape lets the rest of the pipeline continue even when a few URLs fail.
Use Playwright for JavaScript-rendered pages
If requests.get(url).text returns a mostly empty shell but your browser shows real content, the page may be rendering data with JavaScript. Playwright runs Chromium, Firefox, or WebKit programmatically, so you can wait for the rendered DOM and then parse it.
from playwright.sync_api import sync_playwright
def render_html(url: str, selector: str) -> str:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context(
user_agent=HEADERS["User-Agent"],
viewport={"width": 1280, "height": 800},
)
page = context.new_page()
page.goto(url, wait_until="domcontentloaded", timeout=30_000)
page.wait_for_selector(selector, timeout=15_000)
html = page.content()
browser.close()
return html
html = render_html("https://example.com/products", ".product-card")Prefer waiting for a specific selector over waiting for vague page states. The selector says, "the data I need is present." A generic network idle wait can be flaky on pages with analytics, ads, live chat, or long-polling requests.
Playwright is also useful when you need interaction:
from playwright.sync_api import Page
def load_more_products(page: Page) -> None:
while True:
button = page.get_by_role("button", name="Load more")
if not button.count():
return
button.click()
page.wait_for_timeout(750)Use browser automation sparingly. Running hundreds of browser contexts costs memory and CPU. If you only need one JSON response that the page fetches after load, call that endpoint directly instead.
Inspect the Network tab before parsing the DOM
Many "JavaScript scraping" tasks are really API discovery tasks. Open DevTools, refresh the page, and filter Network requests by Fetch/XHR. Look for JSON responses that contain the data you need. If you find one, you can often replace a fragile browser scraper with a simple HTTP request.
import requests
api_url = "https://example.com/api/products"
params = {"page": 1, "category": "chairs"}
response = requests.get(api_url, params=params, headers=HEADERS, timeout=15)
response.raise_for_status()
data = response.json()
for item in data["products"]:
print(item["name"], item["price"])Be careful here. An endpoint being visible in DevTools does not mean it is open for unrestricted use. Check whether the request depends on authentication, CSRF tokens, signed URLs, or Terms of Service restrictions. The right conclusion might be "ask for API access", not "copy every private request header."
Validate extracted data
Scrapers fail quietly when you let every field be a string. A price becomes "Sold out", a date becomes "Coming soon", or a selector starts returning a promotional badge instead of a product title. Add validation close to the parser.
pydantic is useful when records will move into an application or data pipeline:
from decimal import Decimal
from pydantic import BaseModel, HttpUrl, field_validator
class ProductRecord(BaseModel):
title: str
price: Decimal
rating: int
url: HttpUrl
@field_validator("title")
@classmethod
def title_must_not_be_blank(cls, value: str) -> str:
value = value.strip()
if not value:
raise ValueError("title is blank")
return value
@field_validator("rating")
@classmethod
def rating_must_be_valid(cls, value: int) -> int:
if value < 0 or value > 5:
raise ValueError("rating must be between 0 and 5")
return valueValidation should not make your scraper brittle. It should make failures visible. Store invalid records separately with the source URL, raw value, and error message so you can fix the parser without guessing.
Store results in the right format
CSV is still the easiest format for spreadsheets and quick inspection:
import csv
from dataclasses import asdict
from pathlib import Path
def save_books_csv(books: list[Book], path: str) -> None:
output = Path(path)
output.parent.mkdir(parents=True, exist_ok=True)
with output.open("w", newline="", encoding="utf-8") as file:
writer = csv.DictWriter(file, fieldnames=["title", "price", "rating", "url"])
writer.writeheader()
for book in books:
row = asdict(book)
row["price"] = str(book.price)
writer.writerow(row)JSON is better for nested records and API handoff:
import json
def save_books_json(books: list[Book], path: str) -> None:
rows = []
for book in books:
row = asdict(book)
row["price"] = str(book.price)
rows.append(row)
with open(path, "w", encoding="utf-8") as file:
json.dump(rows, file, ensure_ascii=False, indent=2)SQLite is the sweet spot when you need deduplication, incremental updates, and simple querying without running a database server:
import sqlite3
def init_db(path: str = "data/books.sqlite") -> sqlite3.Connection:
conn = sqlite3.connect(path)
conn.execute(
"""
CREATE TABLE IF NOT EXISTS books (
url TEXT PRIMARY KEY,
title TEXT NOT NULL,
price TEXT NOT NULL,
rating INTEGER NOT NULL,
scraped_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP
)
"""
)
return conn
def upsert_books(conn: sqlite3.Connection, books: list[Book]) -> None:
conn.executemany(
"""
INSERT INTO books (url, title, price, rating)
VALUES (?, ?, ?, ?)
ON CONFLICT(url) DO UPDATE SET
title = excluded.title,
price = excluded.price,
rating = excluded.rating,
scraped_at = CURRENT_TIMESTAMP
""",
[(book.url, book.title, str(book.price), book.rating) for book in books],
)
conn.commit()For large jobs, write incrementally. Holding every record in memory is fine for 500 rows and foolish for 5 million. Stream pages through parse, validate, and store steps as soon as each page finishes.
Test parsers with saved fixtures
The fetcher talks to the network. The parser should not need to. Save representative HTML files and test extraction against those fixtures. This catches selector drift, protects you from accidental parser regressions, and lets you work locally without hammering the target site.
A simple fixture workflow:
from pathlib import Path
def load_fixture(name: str) -> str:
return Path("tests/fixtures").joinpath(name).read_text(encoding="utf-8")
def test_parse_books_page() -> None:
html = load_fixture("books-page-1.html")
books = parse_books(html, "https://books.toscrape.com/catalogue/page-1.html")
assert len(books) == 20
assert books[0].title
assert books[0].url.startswith("https://books.toscrape.com/")
assert 0 <= books[0].rating <= 5The most valuable fixtures are not perfect pages. Keep examples of edge cases:
- A normal listing page
- A page with missing prices or empty fields
- A page with a changed card layout
- A page that returned a soft error inside a
200response - A page with no results
- A page with unusual characters, currencies, or encodings
When a production run fails, save the raw HTML and add it as a fixture before changing the parser. That gives you a regression test for the exact breakage. Over time, your fixture folder becomes a map of the target site's weirdness.
You can also test failure behavior directly:
def test_parse_books_skips_incomplete_cards() -> None:
html = """
<html>
<body>
<article class="product_pod">
<h3><a href="book.html" title="Broken book"></a></h3>
</article>
</body>
</html>
"""
books = parse_books(html, "https://example.com/catalogue/page-1.html")
assert books == []That test is small, but it documents an important rule: incomplete cards should not crash the run.
Cache pages and crawl incrementally
Caching makes scrapers cheaper, faster, and easier to debug. If a page has not changed since your last run, you may not need to fetch it again. At minimum, keep a local record of visited URLs, last fetch time, status code, content hash, and extracted record count.
import hashlib
from dataclasses import dataclass
@dataclass
class PageSnapshot:
url: str
status_code: int
content_hash: str
record_count: int
def hash_html(html: str) -> str:
return hashlib.sha256(html.encode("utf-8")).hexdigest()
def snapshot_page(url: str, status_code: int, html: str, record_count: int) -> PageSnapshot:
return PageSnapshot(
url=url,
status_code=status_code,
content_hash=hash_html(html),
record_count=record_count,
)For sites that support HTTP caching, preserve ETag and Last-Modified headers. On the next run, send If-None-Match or If-Modified-Since. A 304 Not Modified response means the server is telling you the page has not changed, so you can reuse your previous parse result.
headers = dict(HEADERS)
headers["If-None-Match"] = previous_etag
response = session.get(url, headers=headers, timeout=15)
if response.status_code == 304:
print("Page unchanged, reusing cached result")
else:
response.raise_for_status()Incremental crawling is the same idea at the URL level. Instead of crawling every page every time, prioritize pages that are new, recently changed, important to the business, or historically unstable. For ecommerce sites, category pages may need frequent refreshes, while old product detail pages can be checked less often. For documentation sites, sitemap lastmod values can help you choose what to revisit first.
Keep incremental logic simple until the data proves it needs to be fancy:
- New URLs go to the front of the queue.
- Failed URLs get a limited number of retries.
- Recently changed URLs are revisited sooner.
- Stable URLs are revisited later.
- Removed pages are marked inactive instead of immediately deleted.
This approach also helps when a run is interrupted. If every page has a status in your database, the next run can resume from unfinished URLs instead of starting over.
Make failures observable
A scraper without logs is a guessing game. At minimum, log the URL, status code, elapsed time, content type, record count, and parser errors.
import logging
import time
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(message)s",
)
def fetch_with_logging(session: requests.Session, url: str) -> str:
started = time.perf_counter()
response = session.get(url, timeout=15)
elapsed_ms = int((time.perf_counter() - started) * 1000)
logging.info(
"fetched url=%s status=%s elapsed_ms=%s bytes=%s",
url,
response.status_code,
elapsed_ms,
len(response.content),
)
response.raise_for_status()
return response.textTrack these metrics over time:
- Fetch success rate
- Parser success rate
- Records extracted per page
- Duplicate rate
- Median and p95 fetch latency
- Count of
403,404,429, and5xxresponses - Number of pages skipped by robots or policy rules
The most useful alert is often "records per page dropped to zero." It catches empty HTML, selector drift, bot pages, and broken JavaScript rendering in one signal.
Debug common scraping failures
When a scraper breaks, do not immediately add proxies or a headless browser. First, identify the failure mode.
| Symptom | Likely cause | What to check |
|---|---|---|
403 Forbidden | Access denied, policy block, missing session, or bot protection | Compare headers, cookies, robots, Terms, and browser behavior |
429 Too Many Requests | Rate limit | Read Retry-After, reduce concurrency, add backoff |
200 with no records | Selector drift or rendered content | Save HTML, inspect it, compare with browser DOM |
200 with CAPTCHA content | Automated access blocked | Stop or use an approved access path |
| Timeout | Slow server, network issue, heavy page | Increase read timeout, retry gently, reduce concurrency |
| Garbled text | Wrong encoding | Check response.encoding, content type, and parser output |
Always save the raw HTML for failed pages:
from pathlib import Path
import hashlib
def save_debug_html(url: str, html: str) -> Path:
digest = hashlib.sha256(url.encode("utf-8")).hexdigest()[:12]
path = Path("data/raw") / f"{digest}.html"
path.parent.mkdir(parents=True, exist_ok=True)
path.write_text(html, encoding="utf-8")
return pathThat file is the fastest way to answer, "Did fetching fail, or did parsing fail?"
Rate limiting and backoff
Polite scraping is not just ethics. It improves reliability. A simple delay is better than nothing, but production scrapers should respond to server signals.
import email.utils
import random
from datetime import datetime, timezone
def retry_after_seconds(value: str | None) -> float | None:
if not value:
return None
if value.isdigit():
return float(value)
try:
retry_at = email.utils.parsedate_to_datetime(value)
except (TypeError, ValueError):
return None
if retry_at.tzinfo is None:
retry_at = retry_at.replace(tzinfo=timezone.utc)
return max(0.0, (retry_at - datetime.now(timezone.utc)).total_seconds())
def polite_sleep(base: float = 1.0, jitter: float = 1.5) -> None:
time.sleep(base + random.random() * jitter)Use exponential backoff for temporary failures:
def backoff_delay(attempt: int, base: float = 0.75, cap: float = 30.0) -> float:
delay = min(cap, base * (2 ** attempt))
return delay + random.uniform(0, delay * 0.25)Backoff is not a way to force your way through a block. It is a way to be less noisy when a server is busy or rate limiting you.
When to consider Scrapy
This guide focuses on requests, BeautifulSoup, httpx, and Playwright because they are easy to understand one piece at a time. Scrapy is worth evaluating when your project starts needing framework features:
- URL scheduling and deduplication
- Per-domain concurrency limits
- Built-in retry and redirect middleware
- Item pipelines for validation and storage
- Incremental crawls
- Crawl depth controls
- Large crawl observability
Scrapy has a steeper learning curve than a small script, but it pays off when the crawler itself becomes a long-lived system. A common path is to prototype extraction with BeautifulSoup, then move the project into Scrapy once you know the target pages, item schema, and crawl rules.
When to use a scraping API
At some point, building scraper infrastructure becomes the job. Browser pools, proxy procurement, CAPTCHA handling, fingerprint consistency, retries, monitoring, and selector maintenance can consume more time than the data product you meant to build.
Build your own scraper when:
- The target pages are simple and stable
- You have permission or own the source
- The extraction logic is highly custom
- Cost per page matters enough to justify ongoing engineering work
- Your team is comfortable operating crawlers
Use a managed scraping API when:
- You need clean Markdown, HTML, screenshots, or structured output quickly
- You do not want to manage browser infrastructure
- The target sites change frequently
- Scraping is an input to your product, not the product itself
- Reliability matters more than owning every low-level detail
Context.dev provides web scraping endpoints that return clean Markdown or raw HTML from a URL, so you can focus on the extraction and application layer.
import requests
response = requests.get(
"https://api.context.dev/v1/scrape/markdown",
headers={"Authorization": "Bearer YOUR_API_KEY"},
params={"url": "https://example.com/products"},
timeout=30,
)
response.raise_for_status()
data = response.json()
print(data["markdown"])You can also use Context.dev for crawling an entire site, extracting raw HTML, and scraping any URL to clean Markdown.
For a deeper walkthrough of extracting HTML through an API, read how to extract raw HTML from any URL with a single API call. If you work in several languages, the guide to web scraping in any language covers Node.js, Go, Ruby, and more.
A production checklist
Before you call a scraper production-ready, run through this checklist:
- The target site's Terms,
robots.txt, and data sensitivity have been reviewed. - Every HTTP request has a timeout.
- Sessions, retries, and backoff are configured intentionally.
- Concurrency is bounded per host.
- Raw HTML is saved for failed pages.
- Parsers are tested against saved fixtures.
- Extracted records have validation.
- Storage is incremental and deduplicated.
- Logs include URL, status, latency, and record count.
- Alerts catch zero-record pages and spikes in
403,429, or parser failures. - The scraper can resume after interruption.
- The code separates fetching, parsing, validation, and storage.
- There is a clear owner for selector maintenance.
This checklist is boring in the best way. It prevents the common failure where a scraper works during a demo, then silently produces bad data for three weeks.
Frequently asked questions
What is the best Python library for web scraping?
For static HTML, start with requests and BeautifulSoup. For many concurrent static pages, use httpx with bounded concurrency. For JavaScript-rendered pages, use Playwright. For large crawls with scheduling, retry policies, and pipelines, evaluate Scrapy. There is no single best library; there is a best fit for the page and operational needs.
Should I use BeautifulSoup or lxml directly?
BeautifulSoup is easier to read and teach. Using it with the lxml parser gives you a good mix of ergonomics and speed. Direct lxml can be faster and gives strong XPath support, but it is less approachable for many teams. If parser speed is your bottleneck, benchmark both on real pages before switching.
How do I scrape a site that uses JavaScript?
First, inspect the Network tab for a JSON endpoint. If the data is available through a stable request that you are allowed to use, call that endpoint directly. If the page must run JavaScript or requires interaction, use Playwright, wait for a specific selector, then parse page.content() with your normal HTML parser.
How do I avoid getting blocked?
The most reliable answer is to scrape politely and within the site's rules: reduce request volume, use reasonable delays, respect rate limits, cache pages, and prefer official APIs. If you receive a CAPTCHA, login wall, or explicit block, treat it as a stop signal unless you have permission and an approved access path.
Is async scraping always faster?
No. Async helps when network waiting dominates and the target site can handle the request rate. It does not make parsing faster, and it can make blocking more likely if you raise concurrency carelessly. Start with small concurrency, measure success rate and latency, then tune.
Should I store scraped data as CSV, JSON, or a database?
Use CSV for quick analysis and spreadsheet workflows. Use JSON for nested data and API handoff. Use SQLite when you need deduplication, incremental updates, resume behavior, and local querying. Move to Postgres or a warehouse when multiple services or analysts need the data.
When should I stop maintaining my own scraper?
When maintenance becomes a regular tax. If you spend more time fixing browser infrastructure, rate limits, selectors, and retries than using the data, a managed API is probably cheaper. The same is true when scraping is only one input to your product and reliability matters more than owning the mechanics.
Wrapping up
Good Python web scraping is disciplined data engineering. Fetch slowly and explicitly, parse defensively, validate records, store incrementally, and make failures visible. The beginner stack of requests plus BeautifulSoup can take you far when pages are simple. httpx adds controlled concurrency. Playwright handles pages that genuinely need a browser. Scrapy becomes useful when the crawl itself turns into a system.
The real skill is knowing when not to add more machinery. Start with the simplest scraper that returns correct data. Add retries because you have measured temporary failures. Add async because the job is network-bound. Add Playwright because the content is rendered client-side. Use a managed API when infrastructure is pulling attention away from the product you actually want to build.