Regular Expressions / Finite Automata¶
This lane is for the moment string matching stops being about:
- one fixed literal pattern
- or many fixed dictionary patterns
and becomes about:
- a pattern language
- Thompson-style epsilon automata
- and matching by simulating a nondeterministic state set instead of backtracking by hand
The repo's exact first route for this family is:
- starter -> regex-thompson-nfa.cpp
- flagship note -> Regex Membership
- compare point -> KMP
- compare point -> Aho-Corasick
- compare point -> Suffix Automaton
This lane intentionally teaches the full-string membership route first for a small regex syntax:
- literal characters
- concatenation
|*- parentheses
.
It does not start from:
- backreferences
- capture groups
- lazy quantifiers
- escaping rules of one industrial regex library
- or DFA minimization
At A Glance¶
- Use when: the pattern itself is a small regular language, not one fixed literal string
- Core worldview: compile the regex into one epsilon-NFA, then scan the text by maintaining the reachable state set
- Main tools: infix-to-postfix conversion with implicit concatenation, Thompson construction, epsilon-closure, and per-character NFA simulation
- Typical complexity: build in
O(m), simulate one text of lengthninO(mn)wheremis regex length - Main trap: confusing this lane with industrial regex libraries or with exact fixed-pattern tools that are much simpler
Strong study signals:
- "the pattern contains
|or*, so the pattern is now a language" - "I want the classical regex-as-NFA story from Algorithms 4/e"
- "the point is to understand automaton simulation, not to call
std::regex"
Strong anti-cues:
- there is only one literal pattern -> KMP or Z-Function
- there are many fixed patterns against one text -> Aho-Corasick
- the real problem is all substrings of one base string -> Suffix Automaton or Suffix Tree
- the task expects PCRE-style features, captures, or backreferences
Prerequisites¶
- KMP, because you should already know what "one fixed exact pattern" feels like before upgrading to a pattern language
- Trie helps as a structural compare point, because automata are still states plus labeled transitions
- Aho-Corasick helps as a compare note, because it is the exact-matching automaton route for many fixed patterns
Helpful compare points:
- KMP: use this when the pattern is one literal string and exact matching is all you need
- Aho-Corasick: use this when many literal patterns are known first
- Suffix Automaton: use this when one base string generates the automaton, not one regex
- String Hashing: not a fit when exact language membership is the point
Problem Model And Syntax Contract¶
The canonical task in this lane is:
- given regex
R - given text
T - decide whether
TmatchesRfully
The repo starter gives special meaning only to:
(and)|*.
Everything else is treated as a literal character.
The starter semantics are:
- implicit concatenation
*applies to the previous atom or parenthesized group.matches any single character- the whole text must match the whole regex
So this lane is not teaching the syntax contract of Java / PCRE / Python regex libraries.
From Literal Matching To Regular-Language Matching¶
One Literal Pattern¶
With KMP or Z, the pattern is one fixed string.
That means:
- every next transition is determined by characters already matched
- there is no branching from
| - there is no repetition operator creating many possible parse lengths
What Changes With Regex¶
The moment the pattern contains:
- alternation like
a|bc - repetition like
(ab)*
the pattern is no longer one string. It is one set of strings.
That is why direct prefix-function style matching is no longer the natural lens.
The Thompson Pivot¶
Instead of backtracking over all choices manually:
- compile the regex into an epsilon-NFA
- keep the current epsilon-closed state set
- consume the text left to right
This converts:
- "which parse branch should I try next?"
into:
- "which NFA states are reachable after this prefix?"
That is the core mental upgrade of this lane.
Core Invariants And Why They Matter¶
The repo starter keeps four contest-facing invariants.
1. Postfix Form Makes Concatenation Explicit¶
Before building the automaton:
- implicit concatenation is inserted explicitly
- the regex is converted to postfix
That is what makes Thompson construction local and stack-based.
2. Every Fragment Represents One Regex Subexpression¶
During Thompson construction, each fragment means:
- one start state
- one list of still-unpatched exits
That is why concat, |, and * can be assembled cleanly without global rewrites.
3. Epsilon-Closure Must Be Complete¶
At every text position, the active state set is not just the states reached by characters. It must also include everything reachable through epsilon transitions.
Missing this closure is the classic bug that makes correct regexes fail spuriously.
4. Match Means Reach The Final Accept State After The Full Text¶
This starter teaches full-string membership.
So success means:
- after consuming all characters of
T - the epsilon-closed active set contains the accept state
This is different from "find a substring match somewhere inside the line."
Complexity And Tradeoffs¶
With regex length m and text length n:
- construction:
O(m) - simulation of one text:
O(mn) - memory:
O(m)
Rule of thumb:
- one literal pattern -> KMP
- many literal patterns -> Aho-Corasick
- one regex language with Thompson syntax -> this lane
Worked Examples¶
Example 1: Repo Canonical Benchmark¶
This repo's flagship note:
The benchmark is intentionally canonical:
- input one regex
- input one text
- answer
1/0for full match
So the whole focus is:
- parse to postfix
- build epsilon-NFA
- simulate the active state set
Example 2: Why This Is Not KMP First¶
If the task is:
- text
T - literal pattern
P - count or locate exact occurrences
then the regex language view is unnecessary overhead.
That is:
Example 3: Why This Is Not Aho-Corasick First¶
If the task is:
- one text
- many literal patterns
then the natural automaton is built from the dictionary, not from one regex expression.
That is:
Repo Starter Contract¶
The starter:
- compiles one regex into one epsilon-NFA
- supports the Thompson-style syntax listed above
- answers one full-string membership query
It does not advertise:
- substring matching by implicit
.* - backreferences
- captures
- escapes
- industrial regex compatibility
Main Traps¶
- forgetting that this lane is about a small explicit regex syntax, not library regex behavior
- trying to use backtracking reasoning instead of state-set simulation
- missing epsilon-closure transitions
- choosing this route when one literal exact matcher already solves the task
Reopen Paths¶
- quick recall -> Regex / Finite Automata hot sheet
- one exact literal pattern -> KMP
- many literal patterns -> Aho-Corasick
- one base string as an automaton -> Suffix Automaton