I've written a lot of scrapers, and almost all of them have failed in the same way at some point. The script runs fine on Monday. By Wednesday every request comes back as a 403 Forbidden or a 429 Too Many Requests, and I'm staring at a stack trace wondering what changed. Usually nothing changed on my end. The site just got tired of me.
That's the thing most tutorials skip: HTTP errors when web scraping aren't random noise. They're the target site talking to you. Sometimes it's a server hiccup. More often it's an anti-bot layer telling you that your requests look automated, arrive too fast, or come from an IP it doesn't trust. Once you learn to read these signals, debugging gets a lot less painful.
This guide walks through the HTTP errors web scraping practitioners actually hit in the wild. For each one I'll explain why it fires, how to confirm what you're dealing with, and give you code you can paste into a real project. If your end goal is feeding clean data to a model, the roundup of web scraping APIs built for AI is a useful companion. Otherwise, let's get into the errors.
A Quick Mental Model: Three Buckets
Before the code, it helps to sort every status code into one of three buckets. I do this in my head every time a request fails, because it tells me whether to retry, fix something, or give up.
- You broke the request. The server understood you fine, but your headers, body, or auth were wrong. These are on you to fix. Think
400,401,422. - The site blocked you. Your request looked automated or abusive, so the anti-bot layer rejected it. This is the interesting bucket. Think
403,429, and a lot of disguised503and520responses. - The server had a problem. Nothing to do with you. The origin is overloaded, down, or timing out. Just retry sensibly. Think
500,502,521,522,524.
The reason this matters: retrying a request from bucket one will never work, no matter how many times you try. Retrying bucket three usually works on its own. Bucket two is where you spend your engineering effort, because the fix is almost always "look more like a real browser and slow down."
HTTP Status Codes at a Glance: a Scraper's Reference
Here's the cheat sheet I keep open. Not every error means you're blocked, and treating them all the same is how people waste hours debugging the wrong thing.
| Status Code | Name | What it usually means for scrapers | Blocked? |
|---|---|---|---|
| 400 | Bad Request | Malformed headers or request body | Sometimes, check your payload |
| 401 | Unauthorized | Missing credentials or auth token | Yes, you need to authenticate first |
| 403 | Forbidden | Anti-bot block, bad UA, bad IP, geo-block | Yes |
| 404 | Not Found | Page doesn't exist or was removed | No, skip and move on |
| 407 | Proxy Auth Required | Your proxy needs credentials | Yes, fix proxy config |
| 408 | Request Timeout | Server gave up waiting for your request | No, transient, retry |
| 410 | Gone | Resource permanently deleted | No, remove from queue |
| 422 | Unprocessable Entity | Server rejects your params or body | Sometimes, check your POST body |
| 429 | Too Many Requests | Rate-limited: too many requests too fast | Yes |
| 451 | Unavailable For Legal Reasons | Content geo-blocked or removed by law | Yes, need different IP or skip |
| 500 | Internal Server Error | Server bug, retry is fine | No, transient, retry with backoff |
| 502 | Bad Gateway | Upstream server gave a bad response | No, transient, retry |
| 503 | Service Unavailable | Overloaded or anti-bot challenge | Sometimes |
| 504 | Gateway Timeout | Upstream server took too long | No, transient, retry |
| 520 | Unknown Error (Cloudflare) | Origin returned unexpected response | Yes |
| 521 | Web Server Is Down (Cloudflare) | Origin is unreachable | No, site is down, retry later |
| 522 | Connection Timed Out (Cloudflare) | TCP handshake timeout to origin | No, transient, retry |
| 524 | A Timeout Occurred (Cloudflare) | Origin took too long to respond | No, transient, retry |
The short version: 4xx codes usually mean the server understood your request and refused it, often because it decided you're a bot. 5xx codes usually mean a server-side problem, frequently transient, but occasionally an anti-bot challenge wearing a 503 costume. We'll get to that disguise later because it trips people up constantly.
How to Actually Diagnose a Block
Most people, myself included on a bad day, read the status code and immediately start guessing. Don't do that. Spend two minutes gathering evidence first. It's faster in the long run.
When a request fails, I check three things before I change a single line of scraping logic.
First, print the response body. A status code tells you almost nothing on its own. The body tells you everything. A real 403 from a CDN often includes a Cloudflare ray ID or a "Sorry, you have been blocked" page. A challenge page says "Checking your browser" or "Just a moment." A genuine application error returns a JSON message. You cannot tell these apart from the status code alone.
import httpx
resp = httpx.get("https://example.com/products")
print("status:", resp.status_code)
print("server:", resp.headers.get("server"))
print("body preview:", resp.text[:600])Second, look at the response headers. The Server, CF-RAY, Retry-After, and cf-mitigated headers reveal who's blocking you and why. If you see server: cloudflare, you now know which anti-bot vendor you're up against. If you see Retry-After, the server is literally telling you how long to wait.
Third, reproduce it in curl. This is the step people skip, and it's the most useful one. If you can reproduce the failure with a plain curl command, you can iterate quickly without rerunning your whole pipeline. Copy the request as cURL straight from your browser's DevTools Network tab, run it, and see if it succeeds where your scraper fails. The difference between those two requests is your bug.
curl -sS -D - "https://example.com/products" \
-H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36" \
-o /dev/nullOnce you've got evidence, you can match it to the sections below. Now the fun part.
403 Forbidden: You Look Like a Bot
Why You're Getting It
A 403 Forbidden during web scraping is almost never a traditional permissions error. It's an active block. The site, or more accurately its CDN or anti-bot layer, looked at your request and decided it didn't come from a real browser. Here are the usual culprits, roughly in the order I check them:
- A missing or unrealistic
User-Agent.python-requests/2.31.0is a flashing neon sign that says "I am a script." Real browsers send long, specific UA strings. - Missing secondary headers. Browsers send
Accept,Accept-Language,Accept-Encoding, theSec-Fetch-*family, andCache-Control. A barerequests.get()sends almost none of that, and the gap is easy to detect. - No cookies or session state. Plenty of sites check whether you accepted a consent banner or carry a session token from a previous page. Show up cold and you look suspicious.
- A TLS fingerprint that doesn't match. Python's
ssllibrary and Chrome produce different TLS Client Hello messages. Tools like Cloudflare fingerprint this (JA3/JA4) at the TCP layer, before they even read your HTTP headers. - A bad IP reputation. Data-center ranges (AWS, GCP, DigitalOcean) are heavily flagged because that's where scrapers live. Residential IPs get the benefit of the doubt.
- Geo-restrictions. Some pages only serve specific countries, and everyone else gets a
403.
I've seen all six of these in production. The good news is they stack from cheap to expensive, so you fix them in order and stop as soon as the block clears.
The Fix
Step 1: Send realistic headers. This alone clears a surprising number of 403s. Open DevTools, find the request in the Network tab, copy the headers a real browser sent, and feed them to your client.
import httpx
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/125.0.0.0 Safari/537.36"
),
"Accept": (
"text/html,application/xhtml+xml,application/xml;"
"q=0.9,image/avif,image/webp,*/*;q=0.8"
),
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
}
with httpx.Client(headers=HEADERS, follow_redirects=True) as client:
resp = client.get("https://example.com/products")
resp.raise_for_status()
print(resp.text[:500])One detail people miss: header order matters to some fingerprinting systems. Browsers send headers in a consistent order, and a few WAFs flag requests where the order looks wrong. Most of the time you won't need to worry about it, but if realistic header values aren't enough, header ordering is the next thing to investigate.
Step 2: Keep a session and seed cookies. If the target sets cookies on its homepage, visit the homepage first so your later requests carry a legitimate-looking session.
import httpx
with httpx.Client(headers=HEADERS, follow_redirects=True) as client:
# Seed cookies by visiting the home page first
client.get("https://example.com/")
# Now hit the target page with the seeded session
resp = client.get("https://example.com/products")Step 3: Switch to residential proxies. If headers and cookies don't help, your IP is the problem. A data-center IP is guilty until proven innocent. Route through rotating proxies backed by a residential pool, and a lot of 403s simply vanish.
import httpx
proxy = "http://user:pass@residential-proxy.example.com:8080"
with httpx.Client(headers=HEADERS, proxy=proxy, follow_redirects=True) as client:
resp = client.get("https://example.com/products")Step 4: Use a headless browser for JavaScript-gated 403s. Some 403s fire because the page runs JavaScript that inspects the browser environment before serving real content. No amount of header tuning fixes that, because requests and httpx never run the JavaScript. You need Playwright or Puppeteer.
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context(
user_agent=HEADERS["User-Agent"],
locale="en-US",
extra_http_headers={"Accept-Language": "en-US,en;q=0.9"},
)
page = context.new_page()
page.goto("https://example.com/products")
html = page.content()
browser.close()If you only need the fully rendered markup and don't want to babysit a browser, the guide on extracting raw HTML from any URL with a single API call covers the managed alternative to running Playwright yourself.
429 Too Many Requests: You're Being Rate-Limited
Why You're Getting It
A 429 Too Many Requests is the server politely asking you to slow down. Most production sites and APIs enforce rate limits: a cap on how many requests one IP or API key can make in a rolling window. Cross it and you get a 429. Many of these responses include a Retry-After header that tells you exactly how long to wait. Ignore it and keep hammering, and a 429 can quietly escalate into a 403 ban that sticks around for hours.
This is the single most common way amateur scrapers fall over at scale, and it's almost always the same mistake: a tight loop with no delays, hitting one IP, with a predictable URL pattern, at machine speed. No human browses like that, and the server knows it.
The Fix: Backoff With Jitter, and Respect Retry-After
The pattern that works is straightforward once you've written it once:
- If the response has a
Retry-Afterheader, wait that long before retrying. - If it doesn't, use exponential backoff: wait
base * 2^attemptseconds. - Add jitter (random noise) so a fleet of workers doesn't all retry at the same instant and re-trigger the limit. This thundering-herd problem is real and I've caused it more than once.
- Cap the retries and raise an error once you hit the cap, instead of looping forever.
import time
import random
import httpx
def fetch_with_backoff(
url: str,
*,
headers: dict | None = None,
max_retries: int = 6,
base_delay: float = 1.0,
max_delay: float = 120.0,
session: httpx.Client | None = None,
) -> httpx.Response:
"""
Fetch a URL with automatic retry on 429 and 503.
Respects Retry-After, uses exponential backoff with full jitter.
"""
_client = session or httpx.Client(follow_redirects=True)
try:
for attempt in range(max_retries):
resp = _client.get(url, headers=headers)
if resp.status_code not in (429, 503):
resp.raise_for_status()
return resp
# Respect Retry-After if the server provided it
retry_after = resp.headers.get("Retry-After")
if retry_after:
try:
wait = float(retry_after)
except ValueError:
# Could be an HTTP-date; fall back to backoff
wait = base_delay * (2 ** attempt)
else:
wait = base_delay * (2 ** attempt)
# Full jitter: sleep for a random fraction of the computed wait
wait = min(wait, max_delay)
jitter = random.uniform(0, wait)
print(f"[attempt {attempt + 1}] {resp.status_code}, sleeping {jitter:.1f}s")
time.sleep(jitter)
raise RuntimeError(f"Exceeded {max_retries} retries for {url}")
finally:
if session is None:
_client.close()
# Usage
resp = fetch_with_backoff(
"https://example.com/api/products",
headers=HEADERS,
base_delay=2.0,
)
print(resp.status_code)Throttle Before You Get Throttled
Backoff is reactive. It kicks in after you've already been told off. A better habit is to pace yourself so you rarely see a 429 in the first place. The cleanest way I know is a token-bucket throttle: you refill tokens at a fixed rate, and every request spends one. When the bucket is empty, you wait. It smooths your traffic into something that looks deliberate instead of frantic.
import time
import threading
class TokenBucket:
"""Allow up to `rate` requests per second, with a small burst allowance."""
def __init__(self, rate: float, burst: int = 1):
self.rate = rate
self.capacity = burst
self.tokens = burst
self.last = time.monotonic()
self.lock = threading.Lock()
def take(self) -> None:
with self.lock:
now = time.monotonic()
self.tokens = min(self.capacity, self.tokens + (now - self.last) * self.rate)
self.last = now
if self.tokens < 1:
sleep_for = (1 - self.tokens) / self.rate
time.sleep(sleep_for)
# Account for the time we just slept so the next call doesn't
# recount it and let pacing drift faster than `rate`.
self.last = time.monotonic()
self.tokens = 0
else:
self.tokens -= 1
# One request every two seconds, no bursting
bucket = TokenBucket(rate=0.5, burst=1)
for url in urls:
bucket.take()
resp = httpx.get(url, headers=HEADERS)Concurrency Limits
When you run workers in parallel, cap concurrency so you don't accidentally fire fifty simultaneous requests from one IP. In asyncio with httpx, a semaphore does the job, and I add a small random delay on top so the requests don't land in lockstep.
import asyncio
import random
import httpx
CONCURRENCY = 5 # max simultaneous in-flight requests
async def scrape_all(urls: list[str]) -> list[str]:
sem = asyncio.Semaphore(CONCURRENCY)
async def fetch(client: httpx.AsyncClient, url: str) -> str:
async with sem:
# small per-request delay in addition to the semaphore
await asyncio.sleep(random.uniform(0.5, 1.5))
r = await client.get(url, headers=HEADERS)
r.raise_for_status()
return r.text
async with httpx.AsyncClient(follow_redirects=True) as client:
tasks = [fetch(client, url) for url in urls]
return await asyncio.gather(*tasks)A good rule of thumb: start slow, measure, then speed up. It's much easier to dial concurrency up after you've confirmed the site tolerates it than to recover from an IP ban because you opened at full throttle.
503 Service Unavailable: Overloaded, or a Challenge in Disguise
Why You're Getting It
503 Service Unavailable is the ambiguous one. It can mean the server is genuinely overloaded and wants you to back off. It can also be an anti-bot challenge page wearing a 503 status code. Some CDNs and web application firewalls return a 503 with an HTML body that says "Checking your browser before you access" or "DDoS protection by [vendor]." If you print the body and see that kind of text, it isn't a real 503 at all. It's a JavaScript challenge, and no amount of backoff will solve it.
This is exactly why the diagnosis step earlier matters. The status code lies; the body doesn't.
The Fix
For a genuine 503 (real overload): retry with exponential backoff, the same way you handle a 429. The fetch_with_backoff function above already covers it, because it treats 503 and 429 identically.
For a challenge-page 503: detect it, then switch tools. A simple keyword heuristic catches most of them.
import httpx
CHALLENGE_SIGNALS = [
"checking your browser",
"ddos protection",
"one more step",
"please wait",
"just a moment",
"enable javascript",
]
def is_challenge_page(resp: httpx.Response) -> bool:
if resp.status_code not in (503, 403):
return False
body_lower = resp.text.lower()
return any(signal in body_lower for signal in CHALLENGE_SIGNALS)
resp = httpx.get("https://example.com", headers=HEADERS)
if is_challenge_page(resp):
print("Challenge page detected, switch to headless browser or scraping API")
else:
print(resp.text[:200])Once you've confirmed it's a challenge, you have two real options: drive a headless browser (Playwright, as shown in the 403 section) or hand the request to a managed scraping API that solves challenges for you. Which one you pick depends on how often you hit it and how much maintenance you're willing to own, which is a tradeoff I'll come back to at the end.
520, 521, 522, 524: The Cloudflare Family
What They Mean
Cloudflare sits in front of a huge slice of the web, so you will run into its custom error codes constantly. They fire when Cloudflare can't successfully proxy your request to the origin server:
- 520 Unknown Error. The origin returned an unexpected or empty response. For scrapers this usually means Cloudflare's bot detection flagged your TLS fingerprint, behavior, or IP reputation and returned a
520instead of a clean403. - 521 Web Server Is Down. Cloudflare is fine, but the origin's TCP port is closed. The site is genuinely down. Retry later.
- 522 Connection Timed Out. Cloudflare's TCP handshake to the origin timed out. Usually transient, so retry with backoff.
- 524 A Timeout Occurred. Cloudflare connected, but the origin took too long to respond. Also transient.
Three of those four (521, 522, 524) are not about you. The one that keeps scraper authors up at night is 520.
Why 520 Is the Tricky One
A 520 during scraping is almost always a fingerprinting problem rather than a real origin error. Cloudflare's Bot Management looks at signals that have nothing to do with your HTTP headers:
- TLS Client Hello (the JA3/JA4 fingerprint). Python's
sslmodule orders cipher suites differently than Chrome does. Cloudflare can block you on that alone, before reading a single header. - HTTP/2 fingerprint. Chrome and
httpxsend HTTP/2 settings, header ordering, and pseudo-header sequences in distinct patterns. The mismatch is detectable. - Behavioral signals. No mouse movement, no scrolling, instant loads with none of the sub-resource fetching a real browser does.
So the fix isn't "send better headers." It's "look like a real browser at the network layer."
Use curl_cffi to impersonate Chrome's TLS fingerprint. This is the cheapest fix and it solves a real chunk of 520s on its own.
from curl_cffi import requests as cffi_requests
# impersonate="chrome120" patches the TLS and HTTP/2 fingerprint
resp = cffi_requests.get(
"https://cloudflare-protected-site.com",
impersonate="chrome120",
headers={"Accept-Language": "en-US,en;q=0.9"},
)
print(resp.status_code, resp.text[:200])If curl_cffi isn't enough, layer on stealth. Libraries like playwright-stealth mask the obvious headless tells: a missing plugin array, the wrong WebGL renderer, the navigator.webdriver flag that screams "automation." And if the site runs full Cloudflare Bot Management and you're still stuck, that's usually my signal that a managed scraping API is the better use of my time than another evening of fingerprint patching.
502 and 504: Gateway Errors
These two show up less in scraping write-ups, but you'll meet them. A 502 Bad Gateway means an upstream server returned an invalid response to the gateway. A 504 Gateway Timeout means the upstream server didn't respond in time. Neither is about your bot. Both are transient. Treat them exactly like a real 503: retry with exponential backoff and a sensible cap. If a particular URL returns 502 or 504 consistently across many retries and a long window, the origin probably has a genuine problem, so log it and move on rather than burning your whole retry budget on one dead endpoint.
The 200 OK That Is Actually a Block
Here's the failure mode that fooled me the longest. You get a 200 OK, your code happily moves on, and a week later you discover your database is full of garbage. The status was fine. The body was a captcha page, a login wall, or an empty shell that loads its real content over an API you never called.
These soft blocks are nastier than a clean 403 because nothing throws an error. You have to validate the content yourself. I now add a sanity check to every scraper: does the page contain a thing I actually expect? A product page should have a price. An article should have more than 500 characters of body text. If the expected marker is missing, I treat the 200 as a failure and route it through the same challenge-handling path as a 503.
def looks_like_real_content(html: str, must_contain: list[str]) -> bool:
lowered = html.lower()
if len(html) < 1000:
return False
return all(marker.lower() in lowered for marker in must_contain)
resp = httpx.get("https://example.com/products/123", headers=HEADERS)
if resp.status_code == 200 and not looks_like_real_content(resp.text, ["add to cart", "price"]):
print("200 OK but content is missing, likely a soft block")Don't trust the status code alone. Trust the bytes you got back.
Other Status Codes You'll Run Into
401 Unauthorized and 407 Proxy Authentication Required
A 401 means the endpoint wants authentication (an API key, an OAuth token, or a session cookie) and you didn't provide it. Add the right Authorization header, or log in first to obtain a session cookie, then reuse it.
A 407 is the same idea one layer down: your proxy server itself wants credentials. Add Proxy-Authorization, or embed user:pass@host:port in the proxy URL so the credentials travel with every request.
404 Not Found and 410 Gone
A 404 means the URL doesn't exist. Maybe the site restructured, the product sold out, or the page was deleted. Don't retry it. Log the URL and skip it. Retrying a 404 is pure wasted budget.
A 410 Gone is a 404 with conviction. The server is explicitly saying the resource is gone for good, so remove it from your crawl queue entirely and don't revisit.
451 Unavailable For Legal Reasons
A 451 means content was withheld because of a legal demand: a DMCA takedown, a court order, a regional regulation. The only real fix is an IP in a jurisdiction where the content isn't restricted. Often the honest answer is that the content is genuinely unavailable to you, so skip it.
422 Unprocessable Entity
A 422 usually fires on POST requests where your body is structurally valid JSON but fails the server's business rules. Check that your fields, required parameters, and data types match what the API expects. The good news is that most 422 responses include a machine-readable error in the body explaining what went wrong, so print resp.text and read it. The answer is almost always right there.
Being a Polite Scraper: robots.txt and Per-Domain Pacing
A lot of blocks are self-inflicted, and the cure is courtesy. Sites are far less likely to fight you when your traffic looks reasonable, so a little restraint goes a long way toward keeping your IPs clean.
Start by reading robots.txt. It's not legally binding everywhere, but it tells you which paths the site would rather you leave alone, and some robots.txt files specify a Crawl-delay that hints at how often you can request without annoying anyone. Python's standard library can parse it for you.
import urllib.robotparser
rp = urllib.robotparser.RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
if rp.can_fetch("MyScraper/1.0", "https://example.com/products"):
print("Allowed by robots.txt")Beyond that, pace yourself per domain, not globally. If you're crawling twenty sites at once, a single global rate limit either crawls everyone too slowly or hits one small site too hard. Track a separate throttle per hostname so a fragile site gets gentle treatment while a robust one gets the throughput it can handle. The TokenBucket from the 429 section works nicely here: keep a dictionary of buckets keyed by hostname and call take() on the right one before each request.
Caching helps too. If your pipeline re-requests the same URL during development (and it always does), cache responses locally so you're not re-hitting the live site every time you tweak a parser. Fewer requests means fewer chances to get blocked, and your tests run faster as a bonus.
A Reusable Resilient Fetch Pattern
Let's pull the useful pieces together: realistic headers, backoff with jitter, optional proxy rotation, and challenge detection in one function I can drop into a project and forget about.
import time
import random
import httpx
from typing import Sequence
BROWSER_HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/125.0.0.0 Safari/537.36"
),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
}
CHALLENGE_SIGNALS = [
"checking your browser",
"ddos protection",
"just a moment",
"enable javascript",
"one more step",
]
RETRYABLE_CODES = {429, 500, 502, 503, 504, 521, 522, 524}
def resilient_fetch(
url: str,
proxy_list: Sequence[str] | None = None,
max_retries: int = 5,
base_delay: float = 1.5,
max_delay: float = 90.0,
extra_headers: dict | None = None,
) -> httpx.Response:
"""
Fetch a URL with realistic browser headers, optional rotating proxies,
exponential backoff with jitter on retryable codes, and challenge detection.
"""
headers = {**BROWSER_HEADERS, **(extra_headers or {})}
proxy_pool = list(proxy_list) if proxy_list else [None]
for attempt in range(max_retries):
# Rotate through proxy pool
proxy_url = proxy_pool[attempt % len(proxy_pool)]
try:
with httpx.Client(
headers=headers,
proxy=proxy_url,
follow_redirects=True,
timeout=20.0,
) as client:
resp = client.get(url)
except (httpx.ConnectError, httpx.TimeoutException) as exc:
print(f"[attempt {attempt + 1}] Network error: {exc}")
_backoff(attempt, base_delay, max_delay)
continue
# Only treat 403/503 bodies as possible challenge pages. Scanning every
# response (including 200) would false-positive on real HTML that happens
# to contain phrases like "just a moment".
if resp.status_code in (403, 503) and any(
sig in resp.text.lower() for sig in CHALLENGE_SIGNALS
):
raise RuntimeError(
f"Challenge page on {url}, use headless browser or scraping API"
)
if resp.status_code not in RETRYABLE_CODES:
resp.raise_for_status()
return resp
# Respect Retry-After header
retry_after = resp.headers.get("Retry-After")
if retry_after:
try:
wait = float(retry_after)
except ValueError:
wait = base_delay * (2 ** attempt)
else:
wait = base_delay * (2 ** attempt)
_backoff_for(min(wait, max_delay))
print(f"[attempt {attempt + 1}] {resp.status_code}, retrying")
raise RuntimeError(f"All {max_retries} retries exhausted for {url}")
def _backoff(attempt: int, base: float, cap: float) -> None:
wait = min(base * (2 ** attempt), cap)
_backoff_for(wait)
def _backoff_for(max_wait: float) -> None:
time.sleep(random.uniform(0, max_wait))
# --- Example usage ---
PROXIES = [
"http://user:pass@proxy1.example.com:8080",
"http://user:pass@proxy2.example.com:8080",
"http://user:pass@proxy3.example.com:8080",
]
resp = resilient_fetch("https://example.com/products", proxy_list=PROXIES)
print(resp.status_code, len(resp.text))For setting up the proxy pool itself, the guide to rotating proxies in web scraping covers sourcing and rotation strategy. If your end goal is feeding an LLM, the list of web scraping APIs built for AI is worth a look, and the writeup on extracting raw HTML from any URL shows how to skip the retry plumbing entirely.
A Real Debugging Session
Let me make this concrete with a job I actually did. I was scraping a company's address from the web for a few thousand domains, and about a fifth of them came back 403. My first instinct was the usual one: blame the IP. So I bought residential proxies, rerouted, and the failure rate barely moved. Money down the drain.
Then I went back to basics and printed the response body. Every failing site returned server: cloudflare and a "Just a moment" page. It wasn't an IP problem at all. It was a TLS fingerprint problem. My httpx requests advertised a cipher order no real Chrome would send, and Cloudflare was bouncing them at the network layer before my fancy proxies ever mattered.
Swapping httpx for curl_cffi with impersonate="chrome120" dropped the failure rate from twenty percent to about three. The last three percent were sites running full Bot Management with behavioral checks, and at that point I made a call: those domains weren't worth a custom Playwright rig with stealth patches, so I routed just that slice through a scraping API and moved on with my life.
The lesson I keep relearning: read the body first. I wasted half a day and real money guessing, when two minutes of looking at the actual response would have pointed me straight at the fix.
When to Stop Fighting and Use a Scraping API
There's a point of diminishing returns, and recognizing it is a skill in itself. If a site runs Cloudflare Bot Management, Akamai Bot Manager, Imperva, or DataDome, you are in an arms race against full-time engineers whose entire job is catching scrapers. Maintaining your own residential proxy pool, keeping TLS fingerprint patches current, solving captchas, and chasing behavioral detection can quietly cost more than the data is worth.
The practical alternative is a scraping API: a managed service that handles proxies, headless browsers, anti-bot bypass, and retries behind a single HTTP call. Context.dev's raw HTML scraping API and its scrape to clean Markdown endpoint do exactly that. You pass a URL, and you get the content back after the service deals with the anti-bot mess for you. If you need to walk an entire site, the website crawling API handles the queue and pagination too.
curl "https://api.context.dev/v1/scrape/markdown?url=https://example.com/products" \
-H "Authorization: Bearer YOUR_API_KEY"You get clean Markdown or raw HTML back without writing a line of proxy rotation, backoff logic, or browser automation. I'm obviously biased here, but the honest version is this: if you're spending more time fighting blocks than using the data, the trade is worth it. If the sites you scrape are lightly protected, the DIY code in this post is genuinely all you need, and you should keep your money. Try the free tier and see which camp your targets fall into before committing more time. For a broader picture of where this fits, the guide to building an AI web research agent shows the API as one piece of a larger pipeline.
Frequently Asked Questions
Why do I get a 403 when scraping but not in my browser?
Your browser sends dozens of headers that requests and httpx don't, including User-Agent, Accept, Accept-Language, and the Sec-Fetch-* family. Sites also check whether your IP is residential or a known data-center range. Your browser runs on a residential ISP connection; your scraper probably runs from a cloud VM with a flagged IP. Add realistic headers first, and if that isn't enough, route through rotating residential proxies.
How do I fix 429 errors when web scraping?
Read the Retry-After header, which tells you how many seconds to wait. If it's absent, use exponential backoff with jitter: wait base * 2^attempt seconds, randomized so parallel workers don't retry in sync. Lower your concurrency and add random delays between requests. The fetch_with_backoff function above implements all of it, and the TokenBucket throttle helps you avoid the 429 in the first place.
What is a 520 error in web scraping?
A 520 is a Cloudflare-specific code meaning the origin returned an unexpected response. For scrapers it almost always means Cloudflare's bot detection blocked you on your TLS fingerprint, IP reputation, or behavior, then returned a 520 instead of a 403. Fix it with curl_cffi to mimic Chrome's TLS fingerprint, or route through a scraping API that keeps clean, trusted IPs.
How do I know if a 503 is a real outage or an anti-bot challenge?
Print the response body. A real 503 is usually short and generic. A challenge page contains text like "Checking your browser" or "Just a moment," and often a cf-mitigated header. The is_challenge_page helper above checks for these signals. Real outages get exponential backoff; challenge pages need a headless browser or a scraping API.
Is it legal to bypass these errors?
Bypassing anti-bot measures sits in a genuine gray area that depends on the site's terms of service, the data you're collecting, and your jurisdiction. Public data is generally treated more permissively than data behind a login, and personal data carries extra obligations under laws like GDPR. I'm a developer, not a lawyer, so the responsible answer is to review the target's terms and get legal advice before scraping anything sensitive or at scale.
How do I avoid getting blocked entirely?
The main levers are: look like a browser (realistic User-Agent, full header set, session cookies), pace yourself (rate-limit, add random delays, cap concurrency), use residential IPs through a rotating proxy service, and handle JavaScript-gated sites with a headless browser or a managed API. Do all four and you'll dodge the large majority of blocks. The guide to rotating proxies in web scraping is a good next read for the proxy side of that setup.
Wrapping Up
HTTP errors in web scraping aren't random, and once you stop treating them that way they get a lot easier to handle. A 403 means you need better headers or a cleaner IP. A 429 means you need backoff and pacing. A 503 might be a transient overload or a JavaScript challenge in disguise, so check the body. A 520 is Cloudflare fingerprinting you at the TLS layer. And a 200 isn't always a win, so validate the content you actually got.
The single most useful habit I can leave you with is the boring one: print the response body before you change anything. Most of the time, the site is telling you exactly what's wrong, and the fix is right there in the bytes. When the anti-bot stack is too aggressive to fight cost-effectively, Context.dev's scraping API is a clean escape hatch with no proxy management and no browser maintenance. Try it free and spend your time on the data instead of the delivery.