Suffix Automaton¶
Suffix automaton is the compact DFA view of all substrings of one string.
Among the standard contest string tools:
- KMP and Z-Function are narrower and easier
- Suffix Array And LCP is often cleaner for static lexicographic order
- suffix automaton is the strongest "one string, all substrings, online build" structure in this cluster
This page teaches the contest route that matters most:
- understand what one state actually means
- understand why clones are necessary
- derive the standard counting / occurrence / LCS tricks from the state invariant
At A Glance¶
Reach for suffix automaton when:
- the string is built left to right and you want to keep a substring structure online
- the task is about all substrings of one base string
- you want distinct-substring counting, occurrence aggregation, or longest common substring with another string
- suffix order is not the main object, but substring behavior is
Strong contest triggers:
- "number of distinct substrings"
- "for each pattern, how many times does it occur in the text?"
- "longest common substring"
- "one fixed base string, many substring-structure questions"
- "online appends, then queries"
Strong anti-cues:
- you only need one exact pattern match: KMP or Z is simpler
- you mainly care about sorted suffixes, suffix ranks, or adjacent LCPs: suffix array is more natural
- the task is about many patterns inserted into one trie-like machine: Aho-Corasick may fit better
- you only need substring equality and collision risk is acceptable: hashing may be much lighter
What success looks like after studying this page:
- you can explain a SAM state as an
endposclass, not as "some node in a graph" - you know why every non-root state represents lengths in one interval
- you can explain what the suffix link of a state means
- you can say exactly when a clone is created and what invariant it repairs
- you can derive the formula for distinct substrings and the standard occurrence-propagation trick
Prerequisites¶
Helpful neighboring topics:
Problem Model And Notation¶
Let \(s\) be a string of length \(n\).
For a substring \(x\), define:
Two substrings are equivalent if they have the same endpos set.
The suffix automaton of \(s\) is the minimal partial DFA recognizing all substrings of \(s\). In contest language, we usually think of each state as one equivalence class of substrings with the same endpos behavior.
For each state v, the standard fields are:
len[v]: the maximum length of a substring represented byvlink[v]: the suffix link ofvnext[v][c]: transition by characterc
The root is state 0, with:
len[0] = 0link[0] = -1
The single most important invariant is:
So every non-root state represents a contiguous interval of substring lengths, not one substring and not an arbitrary set of lengths.
Clone Playground¶
Replay the single clone event while building the suffix automaton of `abaabb`. The important thing to watch is not the whole algorithm at once, but why one existing transition becomes too long for a shorter context and must be split.
Processed text
What to watch
Visual Reading Guide¶
What to notice:
- before the final
b, the suffix-link walk reaches the root and finds an existingbtransition into a state whoselenis too large for the shorter context - after cloning, the root goes by
binto a new state withlen = 1, while the old state keeps the longer interval and changes its suffix link to the clone
Why it matters:
- this is the shortest route from "clones happen in the weird case" to seeing the precise invariant they repair
- it also makes clear that a clone is not a new occurrence of a substring; it is a new representative for the shorter
endposclass that used to be merged too coarsely
Code bridge:
- the conflict is exactly the standard branch
if len[p] + 1 < len[q] - the repair shown here is the textbook clone procedure: copy transitions and suffix link of
q, setlen[clone] = len[p] + 1, redirect affected transitions, then setlink[q] = link[cur] = clone
Boundary:
- this widget isolates one clone event only; the rest of the SAM extension loop still depends on the suffix-link walk that adds missing transitions on the way back
- if you only need one exact pattern search, this whole structure is overkill; reopen KMP or Z-Function
From Brute Force To The Right Idea¶
Brute Force¶
If you want to reason about all substrings of a string of length \(n\), the naive picture is:
- enumerate all \(O(n^2)\) substrings
- store them explicitly
- deduplicate or answer queries over that huge set
This is too large both conceptually and computationally.
First Compression: Many Substrings Behave The Same¶
Suppose two substrings always end at exactly the same positions in the text.
Then whenever one can be extended on the right by some continuation inside the text, the other can too, because they occur in exactly the same ending contexts. So these substrings have the same future behavior.
That suggests a more aggressive summary:
- do not store every substring separately
- merge substrings that have the same
endposset
This is the automaton viewpoint.
Why This Leads To A DFA¶
If a state summarizes one endpos class, then reading one more character should move us to the class of the longer substring.
So transitions are not arbitrary graph edges. They are "extend the represented substrings by one character when that extension is still a substring of \(s\)."
That is why suffix automaton becomes a deterministic acyclic graph of substring behavior.
Why Online Construction Is Possible¶
When we append one new character c to the current text:
- only substrings ending at the new last position are new
- all older internal substring behavior remains valid, except for the places where new suffixes must be connected into the automaton
That is exactly what the standard extension algorithm does:
- create a new state for the new longest suffix
- walk suffix links backward to add missing
ctransitions - split one old state with a clone if an old transition would otherwise violate the length invariant
Core Invariant And Why It Works¶
1. What One State Really Means¶
A SAM state is not "one substring." It is one equivalence class of substrings sharing the same endpos set.
Inside one state v:
- the longest represented substring has length
len[v] - the longest represented substring in the suffix-link parent has length
len[link[v]]
Everything strictly between those two lengths belongs to v as well:
That interval is the reason many classic formulas become one-liners.
2. What The Suffix Link Means¶
For a non-root state v, link[v] points to the state representing the longest proper suffix that lives in a different endpos class.
So suffix links are not arbitrary fallback pointers. They are the structural chain:
- take the longest string in
v - move to the longest proper suffix whose behavior is genuinely different
This immediately implies:
That strict drop is what keeps the graph acyclic and what makes length-ordered DP possible later.
3. Why Walking Suffix Links During Extension Is Correct¶
Suppose we append character c.
The new substrings are exactly the suffixes of the new full string ending at that new last position. Starting from the old last state and walking suffix links enumerates those suffix classes from longest to shorter ones.
Whenever some visited state has no c transition yet, we add one to the new state. This is correct because:
- that state represents a suffix class of the old string
- appending
ccreates a new valid longer substring ending at the new last position - until we meet a state that already had a
ctransition, this extension behavior is genuinely new
4. Why Clones Exist¶
The only delicate case is when we first hit a state p that already has a transition by c to some state q.
If:
then q is already the correct suffix-link target for the new state, and life is easy.
But if:
then q is too "long" to be used directly. It mixes two behaviors that should now be separated:
- the shorter-context behavior needed by the new path
- the older longer-context behavior already stored in
q
So we create a clone:
- copy all outgoing transitions of
q - copy
link[q] - set
len[clone] = len[p] + 1 - redirect the relevant
ctransitions along the suffix-link chain fromqtoclone - make both
qand the new state point by suffix link toclone
The clone is not a new substring occurrence. It is a new representative for the shorter equivalence class that was previously merged too coarsely.
Why are copied transitions still correct? Because cloning does not change which right-extensions are currently valid from that class. It only fixes the boundary between:
- the shorter interval that should branch earlier
- and the longer interval that should remain in
q
5. Why The Structure Stays Small¶
On each appended character:
- we always create one new "real" state
- we create at most one clone
So after processing a string of length \(n\):
for \(n \ge 1\).
The classic bound on transitions is linear as well:
for the standard suffix automaton theorem.
That is the core reason SAM is contest-usable at all: it represents all substrings, but only with linear many states and transitions.
6. Terminal States And Occurrences¶
During construction, every prefix of the processed string corresponds to the current last state at some moment.
Those states are called terminal states. They correspond to suffixes of the whole string.
For occurrence counting, the common pattern is:
- every time we append a character, increment the count of the new real state
- sort states by
len - propagate counts from longer states to
link[v]
After that propagation, the state reached by a pattern stores how many times that pattern occurs in the base string.
Complexity And Tradeoffs¶
For a standard online SAM build:
- states: \(O(n)\)
- transitions: \(O(n)\) total
- build: \(O(n)\) with constant alphabet arrays or expected near-linear with sparse maps
- many classic queries become \(O(|pattern|)\) or one extra DAG DP
Practical tradeoffs:
| Tool | Best when | Time | Main tradeoff |
|---|---|---|---|
| suffix automaton | one base string, many substring-structure questions, online build | linear build | hardest mental model in the standard strings toolkit |
| suffix array + LCP | static lexicographic order / repeated-substring structure | O(n log n) preprocess |
not online |
| KMP / Z | one exact matching or border-style task | linear | much narrower |
| hashing | equality / compare-style tasks with acceptable collision risk | very light | probabilistic and less structural |
| Aho-Corasick | many patterns against one text stream | linear in text + matches | solves the opposite direction of many SAM tasks |
Rule of thumb:
- if the question is "what can all substrings of this one text do?" -> think suffix automaton
- if the question is "where do these many patterns occur in one text?" -> think Aho-Corasick
- if the question is "what is the sorted suffix structure?" -> think suffix array
Variant Chooser¶
| Variant | Use it when | Main idea | Main pitfall |
|---|---|---|---|
| bare SAM build | you need online substring structure | one state per endpos class |
not knowing what a state means |
| SAM + distinct-count formula | count distinct substrings | sum len[v] - len[link[v]] |
counting root or misunderstanding intervals |
| SAM + occurrence propagation | answer frequency queries | push counts from longer states to suffix links | incrementing clones like real states |
| SAM + scan second string | longest common substring | walk transitions, fallback by suffix links | forgetting to shrink current matched length when following links |
| SAM + DAG DP | count / optimize over all substrings | DP in increasing or decreasing len order |
using the wrong topological direction |
Worked Examples¶
Example 1: Distinct Substrings From State Intervals¶
Take:
s = aba
Its non-root SAM states contribute:
- one state with interval
(0, 1], contribution1 - one state with interval
(0, 2], contribution2 - one state with interval
(1, 3], contribution2
So:
distinct substrings.
And indeed the distinct substrings are:
a, b, ab, ba, aba
This is the most important formula on the page:
Why is this a true partition rather than a lucky counting trick?
- each non-root state contributes exactly one substring for every length in its interval
- no substring can belong to two different states, because a substring has exactly one
endposclass
So the intervals do not overlap across states at the level of represented substrings.
Example 2: Why A Clone Appears¶
Take:
s = banana
During the build, when the later a is appended, you encounter a state q already reached by character a, but it is too long:
Intuitively, the old state q was summarizing both:
- the very short substring
a - and longer contexts that also end with
a
After the new append, those two behaviors must be separated. So we create a clone:
- same outgoing transitions as
q - same old suffix link as
q - shorter
len
The clone becomes the correct representative for the shorter-context class, while q keeps the longer-context class.
That is the whole point of cloning:
- not "make another copy just in case"
- but "split one too-large equivalence class so the interval invariant stays true"
Example 3: Longest Common Substring With Another String¶
Build the SAM of:
s = ababa
Now scan:
t = cababd
Maintain:
- current state
v - current matched length
cur
As you read t:
- if the next edge exists, follow it and increase
cur - otherwise follow suffix links until the edge exists or you return to root
- if you follow a suffix link to state
u, resetcurtolen[u]
This scan discovers the longest common substring:
abab
of length 4.
The reason it works is exactly the same SAM invariant:
- state = right-context summary
- suffix link = longest valid fallback class
So scanning another string through the automaton is the substring analogue of "reuse the biggest valid suffix and continue."
Algorithm And Pseudocode¶
Repo starter template:
Best practice anchor for a full solved note:
Standard Extension¶
extend(c):
create cur
len[cur] = len[last] + 1
occ[cur] = 1 // only if you plan to propagate occurrences later
p = last
while p != -1 and next[p][c] does not exist:
next[p][c] = cur
p = link[p]
if p == -1:
link[cur] = root
else:
q = next[p][c]
if len[p] + 1 == len[q]:
link[cur] = q
else:
create clone
copy transitions of q into clone
link[clone] = link[q]
len[clone] = len[p] + 1
occ[clone] = 0 // clone is structural, not a new end position
while p != -1 and next[p][c] == q:
next[p][c] = clone
p = link[p]
link[q] = clone
link[cur] = clone
last = cur
Distinct Substrings¶
answer = 0
for each non-root state v:
answer += len[v] - len[link[v]]
Occurrence Propagation¶
counting sort states by len
process states from longer to shorter:
occ[link[v]] += occ[v]
After this, if a pattern walks to state v, then occ[v] is its number of occurrences.
The important initialization rule is:
- real appended states start with
occ = 1 - clones start with
occ = 0
because clones are structural splits, not new end positions.
Longest Common Substring With Another String¶
v = root
cur = 0
best = 0
for character c in other_string:
while v != root and next[v][c] does not exist:
v = link[v]
cur = len[v]
if next[v][c] exists:
v = next[v][c]
cur++
else:
v = root
cur = 0
best = max(best, cur)
Implementation Notes¶
1. Know Exactly What Each Field Means¶
Do not code SAM while thinking:
lenis "depth"linkis "parent"
That is too vague.
Use the real meaning:
len[v]= maximum represented substring lengthlink[v]= longest proper suffix in a different class
Once those are solid, clone logic becomes much easier to debug.
2. A Clone Is Structural, Not A New Occurrence¶
When you create a clone:
- copy transitions
- copy the suffix link
- set a shorter
len
But do not treat it as a newly seen end position.
This is the classic bug in frequency problems:
- real appended state gets occurrence
1 - clone gets occurrence
0
and only later receives counts through propagation.
3. Propagate In Descending len¶
If you want occurrence counts, process states from longer to shorter.
Why?
Because the suffix link points from a longer class to a shorter one. So the data flow is:
- children / longer states first
- suffix-link parent / shorter state later
Reversing that order silently breaks frequency answers.
4. A Reached State Represents A Whole Class, Not One End Position¶
If a pattern walks successfully to state v, the answer you read from v is about the whole endpos class of that pattern.
This is why one state can answer:
- how many times the pattern occurs
- whether it appears at all
- and, after extra preprocessing, some aggregate over its occurrences
Do not think "the path ended at one occurrence." Think "the path ended at the class summarizing all occurrences of this substring."
5. Arrays Versus Maps¶
For lowercase Latin letters:
- fixed
26-transition arrays are simple and fast
For larger alphabets:
- maps or hash tables are often cleaner
The contest tradeoff is:
- arrays: faster, larger memory, alphabet-specific
- maps: more general, slightly heavier constants
6. State Count Is Linear, But Mental Overhead Is Real¶
Suffix automaton is powerful, but contest mistakes are usually conceptual, not syntactic:
- wrong clone length
- wrong redirect loop
- wrong occurrence propagation
- wrong interpretation of which queries are answered by a reached state
If the task is only about exact pattern matching or borders, the simpler structure is usually the better one.
7. Suffix Automaton Versus Suffix Array¶
This is the most useful judgment call in the static-string setting:
- suffix array is better when lexicographic order and adjacent LCP are the main story
- suffix automaton is better when you want one compact machine for all substrings and state-based aggregation
If the task sounds like:
- "sort suffixes"
- "rank suffixes"
- "RMQ on LCP"
prefer suffix array.
If it sounds like:
- "count all distinct substrings"
- "aggregate occurrences by substring state"
- "scan another string against one growing base text"
prefer suffix automaton.
Practice Archetypes¶
The most common suffix-automaton-flavored tasks are:
- count distinct substrings
- count occurrences of a queried pattern after one preprocessing build
- longest common substring
- DP over all substring states
- online extension of one base string
The strongest contest smell is:
- one string is special
- all substrings of that string matter
- the result depends on state-style aggregation rather than suffix order
References And Repo Anchors¶
Research sweep refreshed on 2026-04-24.
Primary:
Course:
Reference:
Essay / blog:
Practice:
Repo anchors:
- practice ladder: Suffix Automaton ladder
- practice note: Distinct Substrings
- starter template: suffix-automaton.cpp
- starter example: distinctsubstrings.cpp
- notebook refresher: String cheatsheet