Web Scraping & Crawling
What is BeautifulSoup?
A Python library for parsing HTML and XML and extracting data from it using a friendly, forgiving API.
Also known as: bs4
BeautifulSoup wraps an underlying parser (lxml, html.parser, or html5lib) and exposes the resulting tree through a Pythonic interface: soup.find("h1"), soup.select(".price"), tag["href"]. The API is gentle on real-world HTML, which is to say it does not throw a fit when a page is missing closing tags or has nested forms, both of which are common in the wild.
For most scraping work, BeautifulSoup pairs naturally with the requests library: fetch the HTML, hand the bytes to BeautifulSoup, walk the soup. It is the lingua franca of Python tutorials on scraping for a reason. The performance ceiling is lower than lxml directly or selectolax, so heavy crawlers eventually outgrow it; for individual pages or notebook work, the speed difference rarely matters.
BeautifulSoup does not execute JavaScript and does not fetch pages itself. Pair it with a renderer (Playwright, Puppeteer) if the target is a SPA, or with httpx/aiohttp for async fetching at scale.
In the wild
- →A one-off script pulling article titles from a news site
- →Cleaning scraped HTML before feeding it into an LLM context
- →Extracting tables from a Wikipedia dump
How Brand.dev uses beautifulsoup
Endpoints in the Brand.dev API where this concept comes up directly.
FAQ
BeautifulSoup or lxml directly?
BeautifulSoup is friendlier; lxml is faster and stricter. Use lxml when you are parsing millions of documents and the API gymnastics are worth the speedup.
Does BeautifulSoup support CSS selectors?
Yes, via soup.select() and soup.select_one(), which delegate to the soupsieve library and cover most modern CSS3 syntax.
Can BeautifulSoup parse broken HTML?
Yes. Choose html5lib for the most lenient browser-like behavior, lxml for speed on mostly-valid markup, or html.parser for zero extra dependencies.
Related terms
A Python framework for building large-scale web crawlers, with batteries included for scheduling, retries, deduplication, and data pipelines.
Programmatically extracting structured data from websites that were designed to be read by humans.
A pattern that identifies elements in an HTML document by tag, class, id, attribute, or position, used by stylesheets and (heavily) by web scrapers.
A query language for selecting nodes in an XML or HTML document using path expressions, widely used by scrapers when CSS selectors are not expressive enough.
HyperText Markup Language, the markup standard that defines the structure and semantics of every web page.