How to Fix HTTP Errors When Web Scraping (403, 429, 503, 520)

I've written a lot of scrapers, and almost all of them have failed in the same way at some point. The script runs fine on Monday. By Wednesday every request comes back as a 403 Forbidden or a 429 Too Many Requests, and I'm staring at a stack trace wondering what changed. Usually nothing changed on my end. The site just got tired of me.

That's the thing most tutorials skip: HTTP errors when web scraping aren't random noise. They're the target site talking to you. Sometimes it's a server hiccup. More often it's an anti-bot layer telling you that your requests look automated, arrive too fast, or come from an IP it doesn't trust. Once you learn to read these signals, debugging gets a lot less painful.

This guide walks through the HTTP errors web scraping practitioners actually hit in the wild. For each one I'll explain why it fires, how to confirm what you're dealing with, and give you code you can paste into a real project. If your end goal is feeding clean data to a model, the roundup of web scraping APIs built for AI is a useful companion. Otherwise, let's get into the errors.

A Quick Mental Model: Three Buckets

Before the code, it helps to sort every status code into one of three buckets. I do this in my head every time a request fails, because it tells me whether to retry, fix something, or give up.

  1. You broke the request. The server understood you fine, but your headers, body, or auth were wrong. These are on you to fix. Think 400, 401, 422.
  2. The site blocked you. Your request looked automated or abusive, so the anti-bot layer rejected it. This is the interesting bucket. Think 403, 429, and a lot of disguised 503 and 520 responses.
  3. The server had a problem. Nothing to do with you. The origin is overloaded, down, or timing out. Just retry sensibly. Think 500, 502, 521, 522, 524.

The reason this matters: retrying a request from bucket one will never work, no matter how many times you try. Retrying bucket three usually works on its own. Bucket two is where you spend your engineering effort, because the fix is almost always "look more like a real browser and slow down."

HTTP Status Codes at a Glance: a Scraper's Reference

Here's the cheat sheet I keep open. Not every error means you're blocked, and treating them all the same is how people waste hours debugging the wrong thing.

Status CodeNameWhat it usually means for scrapersBlocked?
400Bad RequestMalformed headers or request bodySometimes, check your payload
401UnauthorizedMissing credentials or auth tokenYes, you need to authenticate first
403ForbiddenAnti-bot block, bad UA, bad IP, geo-blockYes
404Not FoundPage doesn't exist or was removedNo, skip and move on
407Proxy Auth RequiredYour proxy needs credentialsYes, fix proxy config
408Request TimeoutServer gave up waiting for your requestNo, transient, retry
410GoneResource permanently deletedNo, remove from queue
422Unprocessable EntityServer rejects your params or bodySometimes, check your POST body
429Too Many RequestsRate-limited: too many requests too fastYes
451Unavailable For Legal ReasonsContent geo-blocked or removed by lawYes, need different IP or skip
500Internal Server ErrorServer bug, retry is fineNo, transient, retry with backoff
502Bad GatewayUpstream server gave a bad responseNo, transient, retry
503Service UnavailableOverloaded or anti-bot challengeSometimes
504Gateway TimeoutUpstream server took too longNo, transient, retry
520Unknown Error (Cloudflare)Origin returned unexpected responseYes
521Web Server Is Down (Cloudflare)Origin is unreachableNo, site is down, retry later
522Connection Timed Out (Cloudflare)TCP handshake timeout to originNo, transient, retry
524A Timeout Occurred (Cloudflare)Origin took too long to respondNo, transient, retry

The short version: 4xx codes usually mean the server understood your request and refused it, often because it decided you're a bot. 5xx codes usually mean a server-side problem, frequently transient, but occasionally an anti-bot challenge wearing a 503 costume. We'll get to that disguise later because it trips people up constantly.

How to Actually Diagnose a Block

Most people, myself included on a bad day, read the status code and immediately start guessing. Don't do that. Spend two minutes gathering evidence first. It's faster in the long run.

When a request fails, I check three things before I change a single line of scraping logic.

First, print the response body. A status code tells you almost nothing on its own. The body tells you everything. A real 403 from a CDN often includes a Cloudflare ray ID or a "Sorry, you have been blocked" page. A challenge page says "Checking your browser" or "Just a moment." A genuine application error returns a JSON message. You cannot tell these apart from the status code alone.

import httpx
 
resp = httpx.get("https://example.com/products")
print("status:", resp.status_code)
print("server:", resp.headers.get("server"))
print("body preview:", resp.text[:600])

Second, look at the response headers. The Server, CF-RAY, Retry-After, and cf-mitigated headers reveal who's blocking you and why. If you see server: cloudflare, you now know which anti-bot vendor you're up against. If you see Retry-After, the server is literally telling you how long to wait.

Third, reproduce it in curl. This is the step people skip, and it's the most useful one. If you can reproduce the failure with a plain curl command, you can iterate quickly without rerunning your whole pipeline. Copy the request as cURL straight from your browser's DevTools Network tab, run it, and see if it succeeds where your scraper fails. The difference between those two requests is your bug.

curl -sS -D - "https://example.com/products" \
  -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36" \
  -o /dev/null

Once you've got evidence, you can match it to the sections below. Now the fun part.

403 Forbidden: You Look Like a Bot

Why You're Getting It

A 403 Forbidden during web scraping is almost never a traditional permissions error. It's an active block. The site, or more accurately its CDN or anti-bot layer, looked at your request and decided it didn't come from a real browser. Here are the usual culprits, roughly in the order I check them:

  • A missing or unrealistic User-Agent. python-requests/2.31.0 is a flashing neon sign that says "I am a script." Real browsers send long, specific UA strings.
  • Missing secondary headers. Browsers send Accept, Accept-Language, Accept-Encoding, the Sec-Fetch-* family, and Cache-Control. A bare requests.get() sends almost none of that, and the gap is easy to detect.
  • No cookies or session state. Plenty of sites check whether you accepted a consent banner or carry a session token from a previous page. Show up cold and you look suspicious.
  • A TLS fingerprint that doesn't match. Python's ssl library and Chrome produce different TLS Client Hello messages. Tools like Cloudflare fingerprint this (JA3/JA4) at the TCP layer, before they even read your HTTP headers.
  • A bad IP reputation. Data-center ranges (AWS, GCP, DigitalOcean) are heavily flagged because that's where scrapers live. Residential IPs get the benefit of the doubt.
  • Geo-restrictions. Some pages only serve specific countries, and everyone else gets a 403.

I've seen all six of these in production. The good news is they stack from cheap to expensive, so you fix them in order and stop as soon as the block clears.

The Fix

Step 1: Send realistic headers. This alone clears a surprising number of 403s. Open DevTools, find the request in the Network tab, copy the headers a real browser sent, and feed them to your client.

import httpx
 
HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/125.0.0.0 Safari/537.36"
    ),
    "Accept": (
        "text/html,application/xhtml+xml,application/xml;"
        "q=0.9,image/avif,image/webp,*/*;q=0.8"
    ),
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "DNT": "1",
    "Upgrade-Insecure-Requests": "1",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-User": "?1",
}
 
with httpx.Client(headers=HEADERS, follow_redirects=True) as client:
    resp = client.get("https://example.com/products")
    resp.raise_for_status()
    print(resp.text[:500])

One detail people miss: header order matters to some fingerprinting systems. Browsers send headers in a consistent order, and a few WAFs flag requests where the order looks wrong. Most of the time you won't need to worry about it, but if realistic header values aren't enough, header ordering is the next thing to investigate.

Step 2: Keep a session and seed cookies. If the target sets cookies on its homepage, visit the homepage first so your later requests carry a legitimate-looking session.

import httpx
 
with httpx.Client(headers=HEADERS, follow_redirects=True) as client:
    # Seed cookies by visiting the home page first
    client.get("https://example.com/")
    # Now hit the target page with the seeded session
    resp = client.get("https://example.com/products")

Step 3: Switch to residential proxies. If headers and cookies don't help, your IP is the problem. A data-center IP is guilty until proven innocent. Route through rotating proxies backed by a residential pool, and a lot of 403s simply vanish.

import httpx
 
proxy = "http://user:pass@residential-proxy.example.com:8080"
 
with httpx.Client(headers=HEADERS, proxy=proxy, follow_redirects=True) as client:
    resp = client.get("https://example.com/products")

Step 4: Use a headless browser for JavaScript-gated 403s. Some 403s fire because the page runs JavaScript that inspects the browser environment before serving real content. No amount of header tuning fixes that, because requests and httpx never run the JavaScript. You need Playwright or Puppeteer.

from playwright.sync_api import sync_playwright
 
with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    context = browser.new_context(
        user_agent=HEADERS["User-Agent"],
        locale="en-US",
        extra_http_headers={"Accept-Language": "en-US,en;q=0.9"},
    )
    page = context.new_page()
    page.goto("https://example.com/products")
    html = page.content()
    browser.close()

If you only need the fully rendered markup and don't want to babysit a browser, the guide on extracting raw HTML from any URL with a single API call covers the managed alternative to running Playwright yourself.

429 Too Many Requests: You're Being Rate-Limited

Why You're Getting It

A 429 Too Many Requests is the server politely asking you to slow down. Most production sites and APIs enforce rate limits: a cap on how many requests one IP or API key can make in a rolling window. Cross it and you get a 429. Many of these responses include a Retry-After header that tells you exactly how long to wait. Ignore it and keep hammering, and a 429 can quietly escalate into a 403 ban that sticks around for hours.

This is the single most common way amateur scrapers fall over at scale, and it's almost always the same mistake: a tight loop with no delays, hitting one IP, with a predictable URL pattern, at machine speed. No human browses like that, and the server knows it.

The Fix: Backoff With Jitter, and Respect Retry-After

The pattern that works is straightforward once you've written it once:

  1. If the response has a Retry-After header, wait that long before retrying.
  2. If it doesn't, use exponential backoff: wait base * 2^attempt seconds.
  3. Add jitter (random noise) so a fleet of workers doesn't all retry at the same instant and re-trigger the limit. This thundering-herd problem is real and I've caused it more than once.
  4. Cap the retries and raise an error once you hit the cap, instead of looping forever.
import time
import random
import httpx
 
def fetch_with_backoff(
    url: str,
    *,
    headers: dict | None = None,
    max_retries: int = 6,
    base_delay: float = 1.0,
    max_delay: float = 120.0,
    session: httpx.Client | None = None,
) -> httpx.Response:
    """
    Fetch a URL with automatic retry on 429 and 503.
    Respects Retry-After, uses exponential backoff with full jitter.
    """
    _client = session or httpx.Client(follow_redirects=True)
    try:
        for attempt in range(max_retries):
            resp = _client.get(url, headers=headers)
 
            if resp.status_code not in (429, 503):
                resp.raise_for_status()
                return resp
 
            # Respect Retry-After if the server provided it
            retry_after = resp.headers.get("Retry-After")
            if retry_after:
                try:
                    wait = float(retry_after)
                except ValueError:
                    # Could be an HTTP-date; fall back to backoff
                    wait = base_delay * (2 ** attempt)
            else:
                wait = base_delay * (2 ** attempt)
 
            # Full jitter: sleep for a random fraction of the computed wait
            wait = min(wait, max_delay)
            jitter = random.uniform(0, wait)
            print(f"[attempt {attempt + 1}] {resp.status_code}, sleeping {jitter:.1f}s")
            time.sleep(jitter)
 
        raise RuntimeError(f"Exceeded {max_retries} retries for {url}")
    finally:
        if session is None:
            _client.close()
 
 
# Usage
resp = fetch_with_backoff(
    "https://example.com/api/products",
    headers=HEADERS,
    base_delay=2.0,
)
print(resp.status_code)

Throttle Before You Get Throttled

Backoff is reactive. It kicks in after you've already been told off. A better habit is to pace yourself so you rarely see a 429 in the first place. The cleanest way I know is a token-bucket throttle: you refill tokens at a fixed rate, and every request spends one. When the bucket is empty, you wait. It smooths your traffic into something that looks deliberate instead of frantic.

import time
import threading
 
class TokenBucket:
    """Allow up to `rate` requests per second, with a small burst allowance."""
 
    def __init__(self, rate: float, burst: int = 1):
        self.rate = rate
        self.capacity = burst
        self.tokens = burst
        self.last = time.monotonic()
        self.lock = threading.Lock()
 
    def take(self) -> None:
        with self.lock:
            now = time.monotonic()
            self.tokens = min(self.capacity, self.tokens + (now - self.last) * self.rate)
            self.last = now
            if self.tokens < 1:
                sleep_for = (1 - self.tokens) / self.rate
                time.sleep(sleep_for)
                # Account for the time we just slept so the next call doesn't
                # recount it and let pacing drift faster than `rate`.
                self.last = time.monotonic()
                self.tokens = 0
            else:
                self.tokens -= 1
 
 
# One request every two seconds, no bursting
bucket = TokenBucket(rate=0.5, burst=1)
 
for url in urls:
    bucket.take()
    resp = httpx.get(url, headers=HEADERS)

Concurrency Limits

When you run workers in parallel, cap concurrency so you don't accidentally fire fifty simultaneous requests from one IP. In asyncio with httpx, a semaphore does the job, and I add a small random delay on top so the requests don't land in lockstep.

import asyncio
import random
import httpx
 
CONCURRENCY = 5  # max simultaneous in-flight requests
 
async def scrape_all(urls: list[str]) -> list[str]:
    sem = asyncio.Semaphore(CONCURRENCY)
 
    async def fetch(client: httpx.AsyncClient, url: str) -> str:
        async with sem:
            # small per-request delay in addition to the semaphore
            await asyncio.sleep(random.uniform(0.5, 1.5))
            r = await client.get(url, headers=HEADERS)
            r.raise_for_status()
            return r.text
 
    async with httpx.AsyncClient(follow_redirects=True) as client:
        tasks = [fetch(client, url) for url in urls]
        return await asyncio.gather(*tasks)

A good rule of thumb: start slow, measure, then speed up. It's much easier to dial concurrency up after you've confirmed the site tolerates it than to recover from an IP ban because you opened at full throttle.

503 Service Unavailable: Overloaded, or a Challenge in Disguise

Why You're Getting It

503 Service Unavailable is the ambiguous one. It can mean the server is genuinely overloaded and wants you to back off. It can also be an anti-bot challenge page wearing a 503 status code. Some CDNs and web application firewalls return a 503 with an HTML body that says "Checking your browser before you access" or "DDoS protection by [vendor]." If you print the body and see that kind of text, it isn't a real 503 at all. It's a JavaScript challenge, and no amount of backoff will solve it.

This is exactly why the diagnosis step earlier matters. The status code lies; the body doesn't.

The Fix

For a genuine 503 (real overload): retry with exponential backoff, the same way you handle a 429. The fetch_with_backoff function above already covers it, because it treats 503 and 429 identically.

For a challenge-page 503: detect it, then switch tools. A simple keyword heuristic catches most of them.

import httpx
 
CHALLENGE_SIGNALS = [
    "checking your browser",
    "ddos protection",
    "one more step",
    "please wait",
    "just a moment",
    "enable javascript",
]
 
def is_challenge_page(resp: httpx.Response) -> bool:
    if resp.status_code not in (503, 403):
        return False
    body_lower = resp.text.lower()
    return any(signal in body_lower for signal in CHALLENGE_SIGNALS)
 
 
resp = httpx.get("https://example.com", headers=HEADERS)
if is_challenge_page(resp):
    print("Challenge page detected, switch to headless browser or scraping API")
else:
    print(resp.text[:200])

Once you've confirmed it's a challenge, you have two real options: drive a headless browser (Playwright, as shown in the 403 section) or hand the request to a managed scraping API that solves challenges for you. Which one you pick depends on how often you hit it and how much maintenance you're willing to own, which is a tradeoff I'll come back to at the end.

520, 521, 522, 524: The Cloudflare Family

What They Mean

Cloudflare sits in front of a huge slice of the web, so you will run into its custom error codes constantly. They fire when Cloudflare can't successfully proxy your request to the origin server:

  • 520 Unknown Error. The origin returned an unexpected or empty response. For scrapers this usually means Cloudflare's bot detection flagged your TLS fingerprint, behavior, or IP reputation and returned a 520 instead of a clean 403.
  • 521 Web Server Is Down. Cloudflare is fine, but the origin's TCP port is closed. The site is genuinely down. Retry later.
  • 522 Connection Timed Out. Cloudflare's TCP handshake to the origin timed out. Usually transient, so retry with backoff.
  • 524 A Timeout Occurred. Cloudflare connected, but the origin took too long to respond. Also transient.

Three of those four (521, 522, 524) are not about you. The one that keeps scraper authors up at night is 520.

Why 520 Is the Tricky One

A 520 during scraping is almost always a fingerprinting problem rather than a real origin error. Cloudflare's Bot Management looks at signals that have nothing to do with your HTTP headers:

  1. TLS Client Hello (the JA3/JA4 fingerprint). Python's ssl module orders cipher suites differently than Chrome does. Cloudflare can block you on that alone, before reading a single header.
  2. HTTP/2 fingerprint. Chrome and httpx send HTTP/2 settings, header ordering, and pseudo-header sequences in distinct patterns. The mismatch is detectable.
  3. Behavioral signals. No mouse movement, no scrolling, instant loads with none of the sub-resource fetching a real browser does.

So the fix isn't "send better headers." It's "look like a real browser at the network layer."

Use curl_cffi to impersonate Chrome's TLS fingerprint. This is the cheapest fix and it solves a real chunk of 520s on its own.

from curl_cffi import requests as cffi_requests
 
# impersonate="chrome120" patches the TLS and HTTP/2 fingerprint
resp = cffi_requests.get(
    "https://cloudflare-protected-site.com",
    impersonate="chrome120",
    headers={"Accept-Language": "en-US,en;q=0.9"},
)
print(resp.status_code, resp.text[:200])

If curl_cffi isn't enough, layer on stealth. Libraries like playwright-stealth mask the obvious headless tells: a missing plugin array, the wrong WebGL renderer, the navigator.webdriver flag that screams "automation." And if the site runs full Cloudflare Bot Management and you're still stuck, that's usually my signal that a managed scraping API is the better use of my time than another evening of fingerprint patching.

502 and 504: Gateway Errors

These two show up less in scraping write-ups, but you'll meet them. A 502 Bad Gateway means an upstream server returned an invalid response to the gateway. A 504 Gateway Timeout means the upstream server didn't respond in time. Neither is about your bot. Both are transient. Treat them exactly like a real 503: retry with exponential backoff and a sensible cap. If a particular URL returns 502 or 504 consistently across many retries and a long window, the origin probably has a genuine problem, so log it and move on rather than burning your whole retry budget on one dead endpoint.

The 200 OK That Is Actually a Block

Here's the failure mode that fooled me the longest. You get a 200 OK, your code happily moves on, and a week later you discover your database is full of garbage. The status was fine. The body was a captcha page, a login wall, or an empty shell that loads its real content over an API you never called.

These soft blocks are nastier than a clean 403 because nothing throws an error. You have to validate the content yourself. I now add a sanity check to every scraper: does the page contain a thing I actually expect? A product page should have a price. An article should have more than 500 characters of body text. If the expected marker is missing, I treat the 200 as a failure and route it through the same challenge-handling path as a 503.

def looks_like_real_content(html: str, must_contain: list[str]) -> bool:
    lowered = html.lower()
    if len(html) < 1000:
        return False
    return all(marker.lower() in lowered for marker in must_contain)
 
 
resp = httpx.get("https://example.com/products/123", headers=HEADERS)
if resp.status_code == 200 and not looks_like_real_content(resp.text, ["add to cart", "price"]):
    print("200 OK but content is missing, likely a soft block")

Don't trust the status code alone. Trust the bytes you got back.

Other Status Codes You'll Run Into

401 Unauthorized and 407 Proxy Authentication Required

A 401 means the endpoint wants authentication (an API key, an OAuth token, or a session cookie) and you didn't provide it. Add the right Authorization header, or log in first to obtain a session cookie, then reuse it.

A 407 is the same idea one layer down: your proxy server itself wants credentials. Add Proxy-Authorization, or embed user:pass@host:port in the proxy URL so the credentials travel with every request.

404 Not Found and 410 Gone

A 404 means the URL doesn't exist. Maybe the site restructured, the product sold out, or the page was deleted. Don't retry it. Log the URL and skip it. Retrying a 404 is pure wasted budget.

A 410 Gone is a 404 with conviction. The server is explicitly saying the resource is gone for good, so remove it from your crawl queue entirely and don't revisit.

451 Unavailable For Legal Reasons

A 451 means content was withheld because of a legal demand: a DMCA takedown, a court order, a regional regulation. The only real fix is an IP in a jurisdiction where the content isn't restricted. Often the honest answer is that the content is genuinely unavailable to you, so skip it.

422 Unprocessable Entity

A 422 usually fires on POST requests where your body is structurally valid JSON but fails the server's business rules. Check that your fields, required parameters, and data types match what the API expects. The good news is that most 422 responses include a machine-readable error in the body explaining what went wrong, so print resp.text and read it. The answer is almost always right there.

Being a Polite Scraper: robots.txt and Per-Domain Pacing

A lot of blocks are self-inflicted, and the cure is courtesy. Sites are far less likely to fight you when your traffic looks reasonable, so a little restraint goes a long way toward keeping your IPs clean.

Start by reading robots.txt. It's not legally binding everywhere, but it tells you which paths the site would rather you leave alone, and some robots.txt files specify a Crawl-delay that hints at how often you can request without annoying anyone. Python's standard library can parse it for you.

import urllib.robotparser
 
rp = urllib.robotparser.RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
 
if rp.can_fetch("MyScraper/1.0", "https://example.com/products"):
    print("Allowed by robots.txt")

Beyond that, pace yourself per domain, not globally. If you're crawling twenty sites at once, a single global rate limit either crawls everyone too slowly or hits one small site too hard. Track a separate throttle per hostname so a fragile site gets gentle treatment while a robust one gets the throughput it can handle. The TokenBucket from the 429 section works nicely here: keep a dictionary of buckets keyed by hostname and call take() on the right one before each request.

Caching helps too. If your pipeline re-requests the same URL during development (and it always does), cache responses locally so you're not re-hitting the live site every time you tweak a parser. Fewer requests means fewer chances to get blocked, and your tests run faster as a bonus.

A Reusable Resilient Fetch Pattern

Let's pull the useful pieces together: realistic headers, backoff with jitter, optional proxy rotation, and challenge detection in one function I can drop into a project and forget about.

import time
import random
import httpx
from typing import Sequence
 
BROWSER_HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/125.0.0.0 Safari/537.36"
    ),
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-User": "?1",
}
 
CHALLENGE_SIGNALS = [
    "checking your browser",
    "ddos protection",
    "just a moment",
    "enable javascript",
    "one more step",
]
 
RETRYABLE_CODES = {429, 500, 502, 503, 504, 521, 522, 524}
 
 
def resilient_fetch(
    url: str,
    proxy_list: Sequence[str] | None = None,
    max_retries: int = 5,
    base_delay: float = 1.5,
    max_delay: float = 90.0,
    extra_headers: dict | None = None,
) -> httpx.Response:
    """
    Fetch a URL with realistic browser headers, optional rotating proxies,
    exponential backoff with jitter on retryable codes, and challenge detection.
    """
    headers = {**BROWSER_HEADERS, **(extra_headers or {})}
    proxy_pool = list(proxy_list) if proxy_list else [None]
 
    for attempt in range(max_retries):
        # Rotate through proxy pool
        proxy_url = proxy_pool[attempt % len(proxy_pool)]
 
        try:
            with httpx.Client(
                headers=headers,
                proxy=proxy_url,
                follow_redirects=True,
                timeout=20.0,
            ) as client:
                resp = client.get(url)
        except (httpx.ConnectError, httpx.TimeoutException) as exc:
            print(f"[attempt {attempt + 1}] Network error: {exc}")
            _backoff(attempt, base_delay, max_delay)
            continue
 
        # Only treat 403/503 bodies as possible challenge pages. Scanning every
        # response (including 200) would false-positive on real HTML that happens
        # to contain phrases like "just a moment".
        if resp.status_code in (403, 503) and any(
            sig in resp.text.lower() for sig in CHALLENGE_SIGNALS
        ):
            raise RuntimeError(
                f"Challenge page on {url}, use headless browser or scraping API"
            )
 
        if resp.status_code not in RETRYABLE_CODES:
            resp.raise_for_status()
            return resp
 
        # Respect Retry-After header
        retry_after = resp.headers.get("Retry-After")
        if retry_after:
            try:
                wait = float(retry_after)
            except ValueError:
                wait = base_delay * (2 ** attempt)
        else:
            wait = base_delay * (2 ** attempt)
 
        _backoff_for(min(wait, max_delay))
        print(f"[attempt {attempt + 1}] {resp.status_code}, retrying")
 
    raise RuntimeError(f"All {max_retries} retries exhausted for {url}")
 
 
def _backoff(attempt: int, base: float, cap: float) -> None:
    wait = min(base * (2 ** attempt), cap)
    _backoff_for(wait)
 
 
def _backoff_for(max_wait: float) -> None:
    time.sleep(random.uniform(0, max_wait))
 
 
# --- Example usage ---
PROXIES = [
    "http://user:pass@proxy1.example.com:8080",
    "http://user:pass@proxy2.example.com:8080",
    "http://user:pass@proxy3.example.com:8080",
]
 
resp = resilient_fetch("https://example.com/products", proxy_list=PROXIES)
print(resp.status_code, len(resp.text))

For setting up the proxy pool itself, the guide to rotating proxies in web scraping covers sourcing and rotation strategy. If your end goal is feeding an LLM, the list of web scraping APIs built for AI is worth a look, and the writeup on extracting raw HTML from any URL shows how to skip the retry plumbing entirely.

A Real Debugging Session

Let me make this concrete with a job I actually did. I was scraping a company's address from the web for a few thousand domains, and about a fifth of them came back 403. My first instinct was the usual one: blame the IP. So I bought residential proxies, rerouted, and the failure rate barely moved. Money down the drain.

Then I went back to basics and printed the response body. Every failing site returned server: cloudflare and a "Just a moment" page. It wasn't an IP problem at all. It was a TLS fingerprint problem. My httpx requests advertised a cipher order no real Chrome would send, and Cloudflare was bouncing them at the network layer before my fancy proxies ever mattered.

Swapping httpx for curl_cffi with impersonate="chrome120" dropped the failure rate from twenty percent to about three. The last three percent were sites running full Bot Management with behavioral checks, and at that point I made a call: those domains weren't worth a custom Playwright rig with stealth patches, so I routed just that slice through a scraping API and moved on with my life.

The lesson I keep relearning: read the body first. I wasted half a day and real money guessing, when two minutes of looking at the actual response would have pointed me straight at the fix.

When to Stop Fighting and Use a Scraping API

There's a point of diminishing returns, and recognizing it is a skill in itself. If a site runs Cloudflare Bot Management, Akamai Bot Manager, Imperva, or DataDome, you are in an arms race against full-time engineers whose entire job is catching scrapers. Maintaining your own residential proxy pool, keeping TLS fingerprint patches current, solving captchas, and chasing behavioral detection can quietly cost more than the data is worth.

The practical alternative is a scraping API: a managed service that handles proxies, headless browsers, anti-bot bypass, and retries behind a single HTTP call. Context.dev's raw HTML scraping API and its scrape to clean Markdown endpoint do exactly that. You pass a URL, and you get the content back after the service deals with the anti-bot mess for you. If you need to walk an entire site, the website crawling API handles the queue and pagination too.

curl "https://api.context.dev/v1/scrape/markdown?url=https://example.com/products" \
  -H "Authorization: Bearer YOUR_API_KEY"

You get clean Markdown or raw HTML back without writing a line of proxy rotation, backoff logic, or browser automation. I'm obviously biased here, but the honest version is this: if you're spending more time fighting blocks than using the data, the trade is worth it. If the sites you scrape are lightly protected, the DIY code in this post is genuinely all you need, and you should keep your money. Try the free tier and see which camp your targets fall into before committing more time. For a broader picture of where this fits, the guide to building an AI web research agent shows the API as one piece of a larger pipeline.

Frequently Asked Questions

Why do I get a 403 when scraping but not in my browser?

Your browser sends dozens of headers that requests and httpx don't, including User-Agent, Accept, Accept-Language, and the Sec-Fetch-* family. Sites also check whether your IP is residential or a known data-center range. Your browser runs on a residential ISP connection; your scraper probably runs from a cloud VM with a flagged IP. Add realistic headers first, and if that isn't enough, route through rotating residential proxies.

How do I fix 429 errors when web scraping?

Read the Retry-After header, which tells you how many seconds to wait. If it's absent, use exponential backoff with jitter: wait base * 2^attempt seconds, randomized so parallel workers don't retry in sync. Lower your concurrency and add random delays between requests. The fetch_with_backoff function above implements all of it, and the TokenBucket throttle helps you avoid the 429 in the first place.

What is a 520 error in web scraping?

A 520 is a Cloudflare-specific code meaning the origin returned an unexpected response. For scrapers it almost always means Cloudflare's bot detection blocked you on your TLS fingerprint, IP reputation, or behavior, then returned a 520 instead of a 403. Fix it with curl_cffi to mimic Chrome's TLS fingerprint, or route through a scraping API that keeps clean, trusted IPs.

How do I know if a 503 is a real outage or an anti-bot challenge?

Print the response body. A real 503 is usually short and generic. A challenge page contains text like "Checking your browser" or "Just a moment," and often a cf-mitigated header. The is_challenge_page helper above checks for these signals. Real outages get exponential backoff; challenge pages need a headless browser or a scraping API.

Is it legal to bypass these errors?

Bypassing anti-bot measures sits in a genuine gray area that depends on the site's terms of service, the data you're collecting, and your jurisdiction. Public data is generally treated more permissively than data behind a login, and personal data carries extra obligations under laws like GDPR. I'm a developer, not a lawyer, so the responsible answer is to review the target's terms and get legal advice before scraping anything sensitive or at scale.

How do I avoid getting blocked entirely?

The main levers are: look like a browser (realistic User-Agent, full header set, session cookies), pace yourself (rate-limit, add random delays, cap concurrency), use residential IPs through a rotating proxy service, and handle JavaScript-gated sites with a headless browser or a managed API. Do all four and you'll dodge the large majority of blocks. The guide to rotating proxies in web scraping is a good next read for the proxy side of that setup.

Wrapping Up

HTTP errors in web scraping aren't random, and once you stop treating them that way they get a lot easier to handle. A 403 means you need better headers or a cleaner IP. A 429 means you need backoff and pacing. A 503 might be a transient overload or a JavaScript challenge in disguise, so check the body. A 520 is Cloudflare fingerprinting you at the TLS layer. And a 200 isn't always a win, so validate the content you actually got.

The single most useful habit I can leave you with is the boring one: print the response body before you change anything. Most of the time, the site is telling you exactly what's wrong, and the fix is right there in the bytes. When the anti-bot stack is too aggressive to fight cost-effectively, Context.dev's scraping API is a clean escape hatch with no proxy management and no browser maintenance. Try it free and spend your time on the data instead of the delivery.

Ship an agent that actually knows things.

Free tier, 10-minute integration, and the same API powering agents at Mintlify, daily.dev, and Propane. No credit card to start.