Web Scraping & Crawling

What is BeautifulSoup?

Q: Does BeautifulSoup support CSS selectors?

Yes, via `soup.select()` and `soup.select_one()`, which delegate to the soupsieve library and cover most modern CSS3 syntax.

Q: Can BeautifulSoup parse broken HTML?

Yes. Choose `html5lib` for the most lenient browser-like behavior, `lxml` for speed on mostly-valid markup, or `html.parser` for zero extra dependencies.

A Python library for parsing HTML and XML and extracting data from it using a friendly, forgiving API.

Also known as: bs4

BeautifulSoup wraps an underlying parser (lxml, html.parser, or html5lib) and exposes the resulting tree through a Pythonic interface: soup.find("h1"), soup.select(".price"), tag["href"]. The API is gentle on real-world HTML, which is to say it does not throw a fit when a page is missing closing tags or has nested forms, both of which are common in the wild.

For most scraping work, BeautifulSoup pairs naturally with the requests library: fetch the HTML, hand the bytes to BeautifulSoup, walk the soup. It is the lingua franca of Python tutorials on scraping for a reason. The performance ceiling is lower than lxml directly or selectolax, so heavy crawlers eventually outgrow it; for individual pages or notebook work, the speed difference rarely matters.

BeautifulSoup does not execute JavaScript and does not fetch pages itself. Pair it with a renderer (Playwright, Puppeteer) if the target is a SPA, or with httpx/aiohttp for async fetching at scale.

In the wild

→A one-off script pulling article titles from a news site
→Cleaning scraped HTML before feeding it into an LLM context
→Extracting tables from a Wikipedia dump

How Brand.dev uses beautifulsoup

Endpoints in the Brand.dev API where this concept comes up directly.

Web Scrape HTML API Markdown Scrape API

FAQ

BeautifulSoup or lxml directly?

BeautifulSoup is friendlier; lxml is faster and stricter. Use lxml when you are parsing millions of documents and the API gymnastics are worth the speedup.

Does BeautifulSoup support CSS selectors?

Yes, via soup.select() and soup.select_one(), which delegate to the soupsieve library and cover most modern CSS3 syntax.

Can BeautifulSoup parse broken HTML?

Yes. Choose html5lib for the most lenient browser-like behavior, lxml for speed on mostly-valid markup, or html.parser for zero extra dependencies.

Related terms

Scrapy

A Python framework for building large-scale web crawlers, with batteries included for scheduling, retries, deduplication, and data pipelines.

Web Scraping

Programmatically extracting structured data from websites that were designed to be read by humans.

CSS Selector

A pattern that identifies elements in an HTML document by tag, class, id, attribute, or position, used by stylesheets and (heavily) by web scrapers.

XPath

A query language for selecting nodes in an XML or HTML document using path expressions, widely used by scrapers when CSS selectors are not expressive enough.

HTML

HyperText Markup Language, the markup standard that defines the structure and semantics of every web page.

←All glossary terms