Web Content & Formats
What is a regular expression?
A pattern language for matching, searching, and extracting substrings from text, used everywhere from code editors to log parsing to data validation.
Also known as: regular expression
A regular expression describes a set of strings using a compact syntax: \d+ matches one or more digits, [A-Za-z]+ matches a run of letters, ^https?:// matches anything starting with http:// or https://. Most languages ship with a regex engine in the standard library; the syntax is mostly portable across engines, with a few quirks (Perl-compatible PCRE, ECMAScript flavor, RE2).
Regex shines for text processing where the structure is locally regular: parsing log lines, validating email shapes, extracting URLs from prose, find-and-replace in code. It is famously poor at parsing genuinely nested grammars (HTML, JSON, programming languages) where a real parser does the job in a third of the lines and never produces the wrong result.
For data extraction work, regex usually plays a supporting role: clean up boilerplate, normalize whitespace, isolate a phone number from a longer string. The mistake teams make is reaching for regex when a proper parser exists; the bigger mistake is reaching for a parser when one well-written regex would do.
In the wild
- →
/[\w.-]+@[\w.-]+\.\w+/gto find email-shaped strings in a page - →
/^[A-Z]{2}\d{2}[A-Z]{4}\d{14}$/to validate an IBAN - →A log-parsing pipeline using regex to pull out timestamp, level, and request-id fields
How Brand.dev uses regex
Endpoints in the Brand.dev API where this concept comes up directly.
FAQ
When should I NOT use regex?
For nested or recursive structures (HTML, JSON, programming languages), use a real parser. The classic Stack Overflow rant about parsing HTML with regex is right about this.
What is catastrophic backtracking?
A pathology where a poorly written regex (often with nested quantifiers like (a+)+) takes exponential time on certain inputs. RE2 (used by Go) avoids this by design; PCRE-style engines do not.
PCRE vs ECMAScript regex?
PCRE supports lookbehind, recursion, named captures, and a richer syntax; ECMAScript regex (in browsers) has been catching up but lacks some PCRE features. Test on the engine your code actually runs on.
Related terms
The process of pulling structured data out of unstructured or semi-structured sources like web pages, PDFs, or emails.
Programmatically extracting structured data from websites that were designed to be read by humans.
JavaScript Object Notation, a lightweight text format for representing structured data, supported natively by every modern language.