Regular Expressions
Regular expressions (regex) are a declarative way to describe patterns in strings. They are built from three operations: union (alternation), concatenation, and Kleene star (repetition). Regular expressions and finite automata are equivalent in power โ every regex can be converted to an NFA, and every NFA can be converted to a regex.
Regex is everywhere: text search, input validation, lexing in compilers, data extraction. Understanding the formal theory behind regex helps you write better patterns and understand their limitations.
Formal Definition
Regular expressions over an alphabet ฮฃ are defined recursively:
Base cases:
ฮต is a regular expression (matches the empty string)
a is a regular expression for each a โ ฮฃ (matches single symbol)
Recursive cases:
If r1 and r2 are regular expressions, then:
(r1 | r2) โ union: matches strings in r1 OR r2
(r1 r2) โ concatenation: matches r1 followed by r2
(r1*) โ Kleene star: matches zero or more copies of r1
Precedence (highest to lowest):
1. Kleene star (*)
2. Concatenation
3. Union (|)
Shorthand:
r+ = r r* (one or more)
r? = r | ฮต (zero or one)
[abc] = a | b | c (character class)
[a-z] = a|b|...|z (range)
Examples
Let's build some regular expressions and understand what they match:
ฮฃ = {0, 1}
Regex: 0|1
Matches: "0", "1"
Regex: (0|1)*
Matches: any binary string (ฮต, "0", "1", "00", "01", ...)
Regex: 0(0|1)*1
Matches: strings starting with 0 and ending with 1
"01" โ, "001" โ, "0111" โ, "0" โ, "1" โ
Regex: (0|1)*01
Matches: strings ending in "01"
"01" โ, "001" โ, "1001" โ, "10" โ
Regex: 0*10*
Matches: strings with exactly one 1
"1" โ, "01" โ, "10" โ, "00100" โ, "011" โ
Regex: (01)*
Matches: "ฮต", "01", "0101", "010101", ...
Equivalence with Finite Automata
The key theorem: a language is regular if and only if it can be described by a regular expression. This means regex, DFA, and NFA are three equivalent ways of describing the same class of languages โ the regular languages.
In practice, regex engines use NFAs (often with extensions like backreferences, which go beyond regular languages). The theoretical equivalence guarantees that any pattern you write in regex can be implemented as a finite automaton, and vice versa.