Web Scraping & Crawling

What is XPath?

A query language for selecting nodes in an XML or HTML document using path expressions, widely used by scrapers when CSS selectors are not expressive enough.

Also known as: XML Path Language

XPath ("XML Path Language") treats a document as a tree and lets you navigate it with expressions like //div[@class="price"]/span[1]. The double slash matches any descendant; predicates in square brackets filter by attribute, position, or arbitrary boolean tests; axes (parent::, following-sibling::, ancestor::) walk the tree in directions CSS cannot.

For web scraping, XPath earns its keep on pages where the data you want is positioned by structure rather than class names: "the third td of the row whose first cell contains the word Total" is one XPath expression and effectively impossible with CSS selectors. Tools like lxml (Python), Selenium, Playwright, and Scrapy all accept XPath alongside CSS.

The downsides are verbosity and brittleness. XPath expressions get long fast, and they break the moment a site rearranges its DOM. The pragmatic rule: prefer CSS for simple class-based selection, and reach for XPath only when you need to reason about position, text content, or relative axes.

In the wild

→//a[contains(@href, "pricing")] to grab every link to a pricing page
→//table[@id="prices"]//tr[td[1][text()="Pro"]]/td[2] to read the Pro tier price out of an HTML table
→Selenium tests asserting on a button identified by visible text rather than a stable id

How Brand.dev uses xpath

Endpoints in the Brand.dev API where this concept comes up directly.

Web Scrape HTML API Markdown Scrape API

FAQ

XPath or CSS selector?

CSS for class- or id-based selection (shorter, faster, more readable). XPath when you need positional logic, text matching, or axis traversal that CSS does not support.

Is XPath only for XML?

It was designed for XML but works on any tree structure. Browser DOMs are XML-shaped, so XPath 1.0 works there. HTML pages with malformed markup may need an HTML parser to normalize them first.

Can browsers run XPath?

Yes, via document.evaluate(). It is awkward compared to querySelectorAll, which is why most front-end work uses CSS, but Selenium and Playwright expose XPath cleanly.

Related terms

CSS Selector

A pattern that identifies elements in an HTML document by tag, class, id, attribute, or position, used by stylesheets and (heavily) by web scrapers.

Web Scraping

Programmatically extracting structured data from websites that were designed to be read by humans.

DOM

The Document Object Model, a tree of objects that represents an HTML document in memory and lets JavaScript manipulate it.

BeautifulSoup

A Python library for parsing HTML and XML and extracting data from it using a friendly, forgiving API.

Scrapy

A Python framework for building large-scale web crawlers, with batteries included for scheduling, retries, deduplication, and data pipelines.

←All glossary terms