Aho-Corasick¶
Aho-Corasick is the standard exact-matching automaton for:
- many patterns
- one text stream
- one preprocessing pass over the pattern dictionary
- then one linear scan over the text
Among the standard contest string tools:
- KMP is for one pattern
- Z-Function is for one-string prefix-style structure
- Suffix Automaton is for all substrings of one base string
- Aho-Corasick is for many patterns against one text
This page teaches the contest route that matters most:
- build a trie of the patterns
- add failure links by BFS
- decide how outputs are propagated
- scan the text or run DP over automaton states
At A Glance¶
Reach for Aho-Corasick when:
- the pattern set is known first
- many patterns share prefixes
- you want to search one long text or many generated strings against the same dictionary
- rerunning KMP once per pattern would waste too much work
Strong contest triggers:
- "given
kpatterns and one text, find all matches" - "for each pattern, does it appear in the text?"
- "count how many forbidden patterns occur"
- "build strings that avoid a forbidden dictionary"
- "multi-pattern exact matching"
Strong anti-cues:
- there is only one pattern: KMP or Z is usually simpler
- you need approximate matching rather than exact matching
- the real problem is about one base string and all its substrings: suffix automaton or suffix array may fit better
- the alphabet is huge and the pattern set is tiny enough that direct hashing / map-based checks are simpler
What success looks like after studying this page:
- you can explain failure links as trie-state fallback, not as a memorized BFS trick
- you know why BFS order is exactly what makes failure-link construction valid
- you can distinguish
terminal,output, andvisited-count propagationviewpoints - you can decide when to complete missing transitions eagerly and when to leave them sparse
- you can reuse the automaton for both text scanning and DP over generated strings
Prerequisites¶
Helpful neighboring topics:
Problem Model And Notation¶
Let:
- \(P = \{p_1, p_2, \dots, p_k\}\) be the pattern set
- \(T\) be the text
- \(\Sigma\) be the alphabet
The trie of the patterns has:
- one node per distinct prefix of a pattern
- one edge per character extension
For each node v, the key fields are usually:
next[v][c]: trie or automaton transition by characterclink[v]: failure linkout[v]: pattern ids, terminal count, or some aggregated output information
The root is node 0.
The most important conceptual invariant is:
- node
vrepresents one trie prefix link[v]is the node for the longest proper suffix of that prefix that is also a trie prefix
So when a transition fails, the automaton does not restart blindly. It falls back to the longest suffix state that could still continue matching.
Automaton Playground¶
Replay the classic dictionary `{he, she, hers, his}` against the text `ushers`. The important thing to watch is that a mismatch does not erase progress; the automaton falls to the longest suffix state that is still a trie prefix.
Text stream
What to watch
Visual Reading Guide¶
What to notice:
- the current node always means one concrete suffix of the processed text: the longest suffix that is also a trie prefix
- on the
rinushers, the automaton does not jump back to the root fromshe; it first follows the failure link tohe, and only then continues byr
Why it matters:
- this is the shortest route from memorizing “failure links are like KMP fallback” to seeing the reuse directly on a real dictionary
- it also shows why output propagation matters: landing on
sheshould report bothsheand suffix patternhe
Code bridge:
- the widget is the standard scan loop: while there is no outgoing trie edge, follow
link[v]; then take the goto edge if it exists - the output list shown on each landing state is exactly what a propagated
out[v]field or terminal-list walk would report
Boundary:
- this demo is dictionary matching only; DP over generated strings reuses the same automaton states, but the outer loop is different
- one pattern does not need this machinery at all; reopen KMP when the dictionary collapses to one string
From Brute Force To The Right Idea¶
Brute Force¶
Suppose you have k patterns and one text of length n.
The naive route is:
- run one exact matcher per pattern
- or restart matching from many text positions
Even if each individual pattern match is efficient, the repeated work is wasteful because many patterns share prefixes.
For example, if the pattern set contains:
he, her, hers, hero
then all four patterns begin with the same prefix he. Matching them independently throws away that shared structure.
First Compression: Store Shared Prefixes Once¶
The trie is the right first step:
- identical prefixes are stored once
- every pattern becomes one root-to-node path
This already compresses the pattern set structurally.
But a raw trie is not enough for streaming text search, because after a mismatch you do not want to restart at the root every time.
Why Failure Links Are The Missing Piece¶
Suppose you are at trie node for:
she
and the next text character does not continue from there.
You should not forget everything. The suffix:
he
may still be a useful matched prefix of some dictionary pattern.
So the correct fallback is:
- not "go to root"
- but "go to the longest proper suffix that is also a trie node"
That is exactly what failure links encode.
Why This Produces An Automaton¶
Once failure links exist, every text character can be processed by:
- following trie edges when they exist
- otherwise following failure links until a valid edge appears
- then advancing by that character
This transforms the trie into a multi-pattern exact-matching automaton.
The original 1975 Aho-Corasick paper describes this in terms of three functions:
gotofailureoutput
That is still the cleanest mental model today.
Core Invariant And Why It Works¶
1. Trie Nodes Represent Pattern Prefixes¶
Before adding failure links, every node already has a clean meaning:
- it is the prefix spelled by the path from the root
So a current node during scanning means:
- "the longest suffix of the processed text that is also a trie prefix is this path label"
This interpretation is what failure links preserve.
2. What The Failure Link Means¶
For a non-root node v, link[v] points to the node whose path label is the longest proper suffix of the label of v that also appears in the trie.
This is the trie-state analogue of KMP fallback:
- KMP falls back along borders of one pattern
- Aho-Corasick falls back along suffix-prefix relationships among trie nodes
If a node spells hers, then its failure link points to the node for the longest proper suffix of hers that is also a dictionary prefix.
3. Why BFS Builds Failure Links Correctly¶
Failure links are built in increasing depth order.
Why BFS?
Because when we compute link[child], we need link[parent] already known.
Suppose:
- node
vis already processed - there is a trie edge by character
cto nodeu
Then the correct failure target of u is:
This works because:
link[v]is already the best proper suffix state forv- if that state can continue by
c, then appendingcgives the longest suffix-prefix candidate foru - if not, repeated failure-link fallback keeps searching shorter suffix-prefix candidates
So BFS is not just an implementation convenience. It is the topological order that makes the recurrence valid.
4. Why Text Scan Is Linear¶
During scanning, each text character may cause:
- zero or more failure-link jumps
- then one successful transition
The key amortized fact is:
- every failure jump strictly shortens the current matched prefix
- every successful goto transition lengthens it by one
So the total amount of fallback work across the whole scan is linear in the text length, just like the failure-reuse argument in KMP.
The original paper emphasizes the same operational property: one input symbol is processed per machine cycle once root behavior is completed properly.
5. Output Semantics: Three Common Contest Views¶
There are three equally standard ways to think about outputs.
Terminal-list view¶
Each terminal node stores the pattern ids that end there.
When the scan reaches a node, you may also need to follow failure links to report all suffix patterns that end there.
This is the cleanest conceptual model.
Aggregated-output view¶
Instead of storing only local terminals, propagate output counts through failure links during BFS:
Then every state already knows how many pattern endings are represented there.
This is what the repo template does for simple counting.
The tradeoff is that plain aggregated counts are great for:
- "how many pattern endings are active here?"
but not enough by themselves for:
- "which exact pattern ids ended here?"
Offline visited-count propagation¶
For yes/no or count-per-pattern problems, a common contest trick is:
- scan the text and count visits to states
- process states in reverse BFS order
- add each state's visits into its failure-link parent
After that, the terminal node of each pattern knows how many times that pattern appeared.
Duplicate patterns are handled naturally here: if two inputs end at the same terminal node, they simply read the same propagated total.
This is the route used in Finding Patterns.
6. Why Automaton DP Works¶
Once missing transitions are completed, the automaton becomes a deterministic machine over the alphabet.
That means DP states such as:
are natural whenever you generate strings character by character.
Typical rule:
- if landing in a terminal / forbidden-output state is illegal, ban those transitions
This is the exact bridge from "pattern matching automaton" to "constraint DP over strings."
Complexity And Tradeoffs¶
Let:
- \(M\) be the total length of all patterns
- \(S\) be the number of trie nodes, so \(S = O(M)\)
Typical bounds:
- trie construction: \(O(M)\)
- BFS failure-link construction: \(O(S \cdot |\Sigma|)\) for dense fixed alphabets
- scan one text: \(O(|T|)\) once transitions are completed densely, or \(O(|T| + \text{matches})\) when reporting explicit matches
- memory: \(O(S \cdot |\Sigma|)\) for dense arrays, smaller but heavier-constant for sparse maps
Practical tradeoffs:
| Tool | Best when | Time | Main tradeoff |
|---|---|---|---|
| Aho-Corasick | many exact patterns against one text / one generated-string DP | linear after preprocessing | more memory and setup than one-pattern tools |
| KMP | one pattern against one text | linear and simpler | cannot reuse prefix structure across many patterns |
| Z-function | one-string prefix-matching tasks | linear and compact | not the right model for large pattern dictionaries |
| suffix automaton | one base string, all substrings matter | linear build | solves the opposite direction of many AC tasks |
| hashing | many equality checks with acceptable collision risk | very light | not a full exact multi-pattern automaton |
Rule of thumb:
- one pattern -> KMP / Z
- many patterns, one text -> Aho-Corasick
- one text, all substrings of it matter -> suffix automaton or suffix array
Variant Chooser¶
| Variant | Use it when | Main idea | Main pitfall |
|---|---|---|---|
| dense lowercase automaton | alphabet is tiny and fixed | complete all missing transitions | memory is tied to alphabet size |
| sparse-edge AC | alphabet is large or patterns are sparse | keep only real edges, follow failure in scan loop | scan code becomes more delicate |
| aggregated-output AC | only counts / presence matter | push output info through failure links | losing per-pattern identity |
| reverse-BFS propagation | per-pattern answers after one scan | count visits, then push counts upward | forgetting reverse BFS order |
| automaton DP | forbid / require patterns in generated strings | DP on (pos, state) |
forgetting to block terminal / output states |
Worked Examples¶
Example 1: The Classic he / she / his / hers¶
Take the dictionary:
he
she
his
hers
This is the original paper's famous example.
The trie shares prefixes:
hhehissh
Now look at the node for:
she
Its longest proper suffix that is also a trie prefix is:
he
So:
link("she") = "he"
That one failure link explains the whole algorithmic idea: after matching she, the automaton still remembers that he is also a valid suffix match.
If we scan:
ushers
then after reading the prefix ending at e, the automaton lands at the node for she, and through its output / failure structure we report:
shehe
Then on the next letters r, s, the automaton eventually reports:
hers
So one pass over the text finds all overlapping matches naturally.
Example 2: Why Reverse BFS Propagation Solves Finding Patterns¶
In Finding Patterns, we only need to know whether each pattern appears at least once.
The clean route is:
- remember the terminal node of each pattern
- scan the text and count how many times each state is visited
- process states in reverse BFS order and add visits into
link[v]
Why does this work?
Because if the scan visits some state v, then every suffix-pattern state on the failure chain also matched at that text position.
So reverse propagation along failure links is exactly aggregating suffix matches.
After that:
visits[terminal(pattern)] > 0
means the pattern occurred at least once.
This trick is perfect for:
- yes / no per pattern
- total occurrence count per terminal class
but it does not by itself recover every individual match position. For full reporting, keep explicit output lists during the scan.
This is a great first Aho-Corasick problem because it teaches:
- terminal nodes
- failure links
- and failure-tree aggregation
without drowning in output-list bookkeeping.
Example 3: Forbidden-Pattern DP¶
Suppose we want to count strings of length n over some alphabet that avoid a dictionary of forbidden words.
Build the Aho-Corasick automaton for the forbidden words.
Then define:
as the number of length-pos prefixes that end in automaton state state without having hit a forbidden output yet.
Transition rule:
- try each next character
- move by the automaton transition
- if the destination state is terminal or has inherited forbidden output, skip it
- otherwise add the DP value
This is the cleanest "Aho-Corasick meets DP" pattern, and it is why this automaton belongs near Digit DP and not only near pure matching topics.
Algorithm And Pseudocode¶
The repo starter implementation is:
Build Trie¶
add_word(word):
v = root
for character c in word:
if next[v][c] does not exist:
create it
v = next[v][c]
mark v as terminal
Build Failure Links By BFS¶
build():
queue q
for each character c:
if root has real child by c:
link[child] = root
push child
else:
next[root][c] = root
while q not empty:
v = pop q
propagate local output from link[v] into v if desired
for each character c:
if v has real child u by c:
link[u] = next[link[v]][c]
push u
else:
next[v][c] = next[link[v]][c]
This "complete the missing transitions" style gives a very simple scan loop later.
Scan One Text¶
state = root
for character c in text:
state = next[state][c]
consume output(state)
If you keep sparse edges instead, replace that one-line transition by:
while state != root and next[state][c] missing:
state = link[state]
if next[state][c] exists:
state = next[state][c]
else:
state = root
So the sparse version keeps the same logical fallback, but performs it explicitly during the scan instead of pre-filling all automaton transitions.
Reverse-BFS Visit Propagation¶
scan text and count visits[state]
for v in reverse_bfs_order:
visits[link[v]] += visits[v]
Now a terminal node stores the total number of text positions whose failure chain reaches it.
Implementation Notes¶
1. Decide Dense Versus Sparse Early¶
For lowercase English patterns:
- fixed
26-transition arrays are simple and fast
For large alphabets:
- maps or hash tables may save a lot of memory
The tradeoff is exactly the usual one:
- dense arrays: fastest transitions, more memory
- sparse maps: more flexible, heavier constants
2. Root Behavior Is Part Of Correctness¶
The original paper emphasizes that root transitions should never fail during scanning.
In dense implementations, this is handled by filling:
next[root][c] = root
for missing characters.
That is why the scan loop can process one input character cleanly each iteration.
In sparse implementations, you keep the same semantic rule, but write it in code as:
- "while I can still fail, follow
link" - and if root still has no edge, stay at root
3. terminal And output Are Not The Same Concept¶
A node may be:
- directly terminal for some pattern
- or non-terminal but still have inherited output through failure links
This distinction matters a lot in automaton DP:
- a state can be forbidden even if its own trie path is not itself a full pattern
4. Reverse BFS Is The Right Direction For Failure Aggregation¶
Failure links go from deeper states to shallower suffix states.
So if you want to push information from matches to all suffix patterns, the correct data flow is:
- longer / deeper states first
- then their failure-link parents
That is exactly why reverse BFS order works.
5. Aho-Corasick Versus KMP¶
This is the single most important boundary:
- KMP preprocesses one pattern
- Aho-Corasick preprocesses a whole dictionary
You can think of Aho-Corasick as "KMP fallback, but on trie states."
If there is only one pattern, Aho-Corasick is usually unnecessary overhead.
6. Aho-Corasick Versus Suffix Automaton¶
These two are often confused because both are automata over strings.
But they solve opposite modeling directions:
- Aho-Corasick: many patterns, one text
- suffix automaton: one text, all substrings of it
If the problem statement starts from a pattern dictionary, Aho-Corasick is much more likely to be the natural fit.
Practice Archetypes¶
The most common Aho-Corasick-flavored tasks are:
- does each pattern appear?
- count all matches of many patterns
- report pattern ids at matches
- count / construct strings avoiding forbidden patterns
- DP on
(position, automaton_state)
The strongest contest smell is:
- the pattern set is reused
- shared prefixes should matter
- exact matching is still required
References And Repo Anchors¶
Research sweep refreshed on 2026-04-24.
Primary:
Course:
Reference:
Practice:
Repo anchors:
- practice ladder: Aho-Corasick ladder
- practice note: Finding Patterns
- starter template: aho-corasick.cpp
- notebook refresher: String cheatsheet