Web Scraping & Crawling

What is BeautifulSoup?

A Python library for parsing HTML and XML and extracting data from it using a friendly, forgiving API.

Also known as: bs4

BeautifulSoup wraps an underlying parser (lxml, html.parser, or html5lib) and exposes the resulting tree through a Pythonic interface: soup.find("h1"), soup.select(".price"), tag["href"]. The API is gentle on real-world HTML, which is to say it does not throw a fit when a page is missing closing tags or has nested forms, both of which are common in the wild.

For most scraping work, BeautifulSoup pairs naturally with the requests library: fetch the HTML, hand the bytes to BeautifulSoup, walk the soup. It is the lingua franca of Python tutorials on scraping for a reason. The performance ceiling is lower than lxml directly or selectolax, so heavy crawlers eventually outgrow it; for individual pages or notebook work, the speed difference rarely matters.

BeautifulSoup does not execute JavaScript and does not fetch pages itself. Pair it with a renderer (Playwright, Puppeteer) if the target is a SPA, or with httpx/aiohttp for async fetching at scale.

In the wild

  • A one-off script pulling article titles from a news site
  • Cleaning scraped HTML before feeding it into an LLM context
  • Extracting tables from a Wikipedia dump

How Brand.dev uses beautifulsoup

Endpoints in the Brand.dev API where this concept comes up directly.

FAQ

BeautifulSoup or lxml directly?

BeautifulSoup is friendlier; lxml is faster and stricter. Use lxml when you are parsing millions of documents and the API gymnastics are worth the speedup.

Does BeautifulSoup support CSS selectors?

Yes, via soup.select() and soup.select_one(), which delegate to the soupsieve library and cover most modern CSS3 syntax.

Can BeautifulSoup parse broken HTML?

Yes. Choose html5lib for the most lenient browser-like behavior, lxml for speed on mostly-valid markup, or html.parser for zero extra dependencies.

Related terms

Ship an agent that actually knows things.

Free tier, 10-minute integration, and the same API powering agents at Mintlify, daily.dev, and Propane. No credit card to start.