Web Scraping & Crawling
What is a CSS selector?
A pattern that identifies elements in an HTML document by tag, class, id, attribute, or position, used by stylesheets and (heavily) by web scrapers.
CSS selectors started as the targeting mechanism for stylesheets: .price styles every element with class "price". The same syntax doubles as the de facto query language for the DOM, exposed through document.querySelectorAll in browsers and through every major scraping library (BeautifulSoup, Cheerio, Scrapy, Playwright, Puppeteer).
Selectors compose well. article.featured > h2 a[href^="/blog"] reads as: an a tag whose href starts with /blog, inside an h2, that is a direct child of an article with class featured. Combinators (> child, + adjacent sibling, ~ general sibling) and attribute selectors ([type="email"], [data-id*="user"]) cover most extraction patterns.
Where CSS falls short is positional logic and text matching. There is no "element whose text contains X" in CSS, and :nth-child works on element index rather than filtered subsets. Scrapers that need those reach for XPath or post-filter the matches in code.
In the wild
- →
h1.product-titleto grab the headline of a product page - →
meta[property="og:image"]to pull the Open Graph image URL out of<head> - →
a.btn-primary[href*="checkout"]to find a checkout CTA reliably across page variants
How Brand.dev uses css selector
Endpoints in the Brand.dev API where this concept comes up directly.
FAQ
Are CSS selectors faster than XPath?
In browsers, yes, querySelectorAll is highly optimized. In server-side parsers like BeautifulSoup or lxml the gap is narrower, but CSS still tends to be faster for simple cases.
How specific should my selectors be?
Specific enough to disambiguate, loose enough to survive minor markup changes. Anchor on stable attributes (data-test, role, aria-label) rather than auto-generated class names from CSS-in-JS.
Can I use pseudo-classes when scraping?
Some, like :nth-of-type or :first-child, are supported by most parsers. Browser-only pseudo-classes (:hover, :focus) are meaningless server-side.
Related terms
A query language for selecting nodes in an XML or HTML document using path expressions, widely used by scrapers when CSS selectors are not expressive enough.
The Document Object Model, a tree of objects that represents an HTML document in memory and lets JavaScript manipulate it.
Cascading Style Sheets, the language browsers use to style HTML: colors, typography, layout, animation, and responsive behavior.
Programmatically extracting structured data from websites that were designed to be read by humans.
A Python library for parsing HTML and XML and extracting data from it using a friendly, forgiving API.