Suffix Array And LCP¶
Suffix array is the contest-friendly exact structure for:
- suffix order
- lexicographic substring logic
- repeated-substring reasoning
- and many "one static string, many structural questions" problems
It is heavier than String Hashing, but much lighter conceptually than a suffix tree and often easier to explain and debug than Suffix Automaton for static-string tasks.
This page teaches the classic contest route:
- build suffix array by doubling
- build LCP by Kasai
- reuse
sa,rank, andlcpfor common exact-string tasks
At A Glance¶
Reach for suffix array when:
- suffixes need to be sorted exactly
- repeated-substring structure matters
- lexicographic order of suffixes or substrings is the main object
- the string is static and preprocessing-heavy queries are acceptable
Strong contest triggers:
- "longest repeated substring"
- "number of distinct substrings"
- "sort suffixes or rank suffixes"
- "exact substring comparisons after preprocessing"
- "one static string with many lexicographic or LCP-style queries"
Strong anti-cues:
- you only need one narrow exact-matching task: KMP or Z-Function may be cleaner
- substring equality alone is enough and probabilistic methods are acceptable: hashing may be simpler
- the task is online / dynamic over the string
- suffix order is overkill for the actual question
What success looks like after studying this page:
- you can explain what
sa,rank, andlcpeach mean - you understand why doubling by equivalence classes works
- you know why many explanations append a sentinel, and why this repo template instead uses virtual
-1ranks for the missing second half - you can explain Kasai's
k--reuse intuition - you can derive distinct-substring and longest-repeated-substring formulas from
lcp
Prerequisites¶
Helpful neighboring topics:
Problem Model And Notation¶
Let s be a string of length n.
The suffix starting at position i is:
The suffix array sa is the list of starting indices of all suffixes in lexicographic order.
So:
The inverse array is usually called rank:
The LCP array stores longest common prefixes of adjacent suffixes in suffix-array order:
This array has length n - 1 if no sentinel row is kept in the final structure.
From Brute Force To The Right Idea¶
Brute Force¶
The naive way to build a suffix array is:
- generate all suffix strings
- sort them directly
Sorting needs O(n log n) comparisons, but each comparison may take O(n) time, so the total can reach:
This is too slow at contest sizes.
The Structural Observation¶
We do not need to compare whole suffixes from scratch at every sort step.
Instead, we can sort suffixes by:
- first
1character - then first
2characters - then first
4characters - then first
8characters - and so on
This gives the doubling approach.
Why Equivalence Classes Are The Right Summary¶
Suppose we already know the equivalence class of every suffix by its first 2^k characters.
Then to compare suffixes by their first 2^{k+1} characters, it is enough to compare the pair:
where c_k[i] is the class of the first half and c_k[i + 2^k] is the class of the second half.
So a longer comparison is reduced to a pair of shorter comparisons already summarized by class ids.
That is the entire reason the doubling algorithm works.
Why A Sentinel Helps¶
The cleanest contest implementation appends a sentinel character:
- smaller than every real character
- occurring only once at the end
This does two things:
- every suffix becomes comparable without running off the end
- sorting cyclic shifts of
s + '$'becomes equivalent to sorting suffixes ofs
That is why many textbook and CP-Algorithms explanations start with an explicit sentinel.
The repo starter template takes a slightly different but equivalent route: it does not append a sentinel to the stored string, and instead treats the missing second half rank as -1 during doubling. The teaching picture is still the same:
- a shorter suffix should compare smaller once one side runs out of characters
- you need one consistent rule for "nothing is here anymore"
So the sentinel model is the cleanest way to think, while the virtual--1 model is the concrete policy used in this repo snippet.
Core Invariant And Why It Works¶
Doubling Invariant¶
After iteration k, we maintain:
sa: suffixes sorted by their first2^kcharactersc_k[i]: the equivalence class of suffixiunder comparison of the first2^kcharacters
This means:
if and only if the length-2^k prefixes of suffixes i and j are equal.
So c_k[i] is not "some helper label". It is exactly the rank of suffix i under the truncated key of length 2^k.
Why The Next Iteration Uses Rank Pairs¶
To compare suffixes i and j by 2^{k+1} characters, split each prefix into two halves of length 2^k.
Then:
is determined by:
So sorting suffixes by those rank pairs gives the correct order for the next doubling level. In other words:
c_k[i]summarizes the first halfc_k[i + 2^k]summarizes the second half- the ordered pair is a complete signature for the first
2^{k+1}characters
This is the real induction step:
- assume
c_kcorrectly ranks length-2^kprefixes - sort by pairs of those ranks
- compress equal pairs into new classes
c_{k+1}
That is the induction step of the construction.
When Doubling Stops¶
Once:
the compared prefix length already covers every suffix completely, so the suffix order is final.
At that point all suffixes have unique relative positions.
Kasai's Invariant¶
After building sa, we want lcp.
Kasai uses the inverse array rank, and scans suffixes in original index order.
Suppose the current suffix s[i..] has LCP length k with its next suffix in suffix-array order.
If we delete the first character from both matched suffixes, we immediately get a common prefix of length k - 1 between s[i+1..] and the shifted comparison partner. The actual neighbor of s[i+1..] in suffix-array order might be different, but it cannot force the best possible LCP below that already-known k - 1 floor before we start extending again.
So when we move from suffix i to suffix i+1, the reusable part of the previous answer can only drop by 1.
That is the reason for the standard reuse line:
if k > 0: k--
before extending again.
Why Kasai Is Linear¶
Every successful character comparison increases k.
Across the whole algorithm:
kcan only increase up ton- and it decreases by at most
1per outer iteration
So the total number of comparisons is O(n).
This is the LCP analogue of the usual amortized trick in linear string algorithms:
- reuse most of the previous work
- only pay for frontier expansion
Complexity And Tradeoffs¶
For the classic doubling route:
- suffix array:
O(n \log n) - Kasai LCP:
O(n) - memory: usually
O(n)toO(n \log n)depending on whether intermediate class tables are stored
Practical tradeoffs:
| Tool | Best when | Time | Main tradeoff |
|---|---|---|---|
| suffix array + LCP | exact static-string order / repeated structure | preprocess O(n \log n) |
heavier than hashing for narrow tasks |
| string hashing | equality / quick comparisons with acceptable collision risk | light and practical | probabilistic unless carefully hardened |
| suffix automaton | substring-state counting / endpos-style reasoning | powerful and linear | more abstract and harder to debug for many learners |
| KMP / Z | narrow exact-match or prefix-structure tasks | linear and simpler | too specialized for full suffix ordering |
Rule of thumb:
- if the task is about suffix order or repeated-substring structure, suffix array is a strong exact static-string choice
- if you only need one small exact-matching primitive, it is often overkill
Variant Chooser¶
| Variant | Use it when | Main idea | Main pitfall |
|---|---|---|---|
| doubling suffix array | standard contest baseline | ranks of length 1, 2, 4, 8, ... |
sentinel / pair-comparison mistakes |
| suffix array + Kasai | you also need adjacent LCP information | reuse k - 1 when shifting suffix start |
mixing sa with rank |
suffix array + RMQ on lcp |
many arbitrary suffix-LCP queries | RMQ over adjacent LCPs | forgetting that arbitrary suffix LCP becomes a range minimum |
| cyclic-shift variant | minimal rotation / cyclic order tasks | sort cyclic shifts | mixing true suffix order with cyclic order |
Worked Examples¶
Example 1: A Tiny Suffix Array¶
Take:
s = banana
The suffixes are:
0 banana
1 anana
2 nana
3 ana
4 na
5 a
Sorted:
5 a
3 ana
1 anana
0 banana
4 na
2 nana
So:
This single example is worth memorizing because it makes the meanings of sa and rank concrete immediately.
To make the doubling construction feel less magical, look at one real round on the same string.
Start by classifying suffixes by their first character:
index: 0 1 2 3 4 5
char: b a n a n a
class: 1 0 2 0 2 0
Now build length-2 keys by pairs:
0 -> (1, 0)
1 -> (0, 2)
2 -> (2, 0)
3 -> (0, 2)
4 -> (2, 0)
5 -> (0, -1)
Sorting by those pairs gives:
5, 1, 3, 0, 2, 4
with new length-2 classes:
index: 0 1 2 3 4 5
class2: 2 1 3 1 3 0
That is the whole doubling idea in miniature:
- one round starts from class ids for length
2^k - sorts by rank pairs
- and produces class ids for length
2^{k+1}
Later rounds keep doing the same thing until the order stabilizes into the final suffix array.
Example 2: Distinct Substrings¶
Every suffix contributes all of its prefixes as candidate substrings.
Suffix sa[i] has:
prefixes in total.
But among those, exactly lcp[i-1] are duplicates of prefixes already seen from the previous suffix in sorted order.
Why is the previous suffix enough?
Because if the current suffix shared a longer prefix with some even earlier suffix, then all suffixes sharing that prefix would form one contiguous block in suffix-array order. So the nearest previous suffix inside that block shares at least as much. The "largest duplicate overlap with the past" is therefore already captured by the immediate previous neighbor.
So the number of new substrings contributed by suffix sa[i] is:
with lcp[-1] treated as 0.
Equivalently:
This is one of the cleanest suffix-array applications.
Example 3: Longest Repeated Substring¶
If a substring appears at least twice, then there are at least two suffixes sharing it as a prefix.
In suffix-array order, all suffixes sharing a common prefix form a contiguous block.
So some adjacent pair in that block has at least that common-prefix length.
Therefore:
This is exactly the logic behind Repeating Substring.
Example 4: Arbitrary Suffix LCP From Adjacent LCP¶
Suppose suffixes i and j appear at positions rank[i] < rank[j] in the suffix array.
Then:
So arbitrary suffix-LCP queries reduce to RMQ on the lcp array.
This is one of the most important conceptual bridges:
- adjacent LCP information
- plus RMQ
- gives general suffix-LCP queries
Algorithm And Pseudocode¶
The repo starter implementation is:
Doubling Construction¶
build_suffix_array(s):
append sentinel if needed
initialize classes by first character
for k = 0, 1, 2, ... while 2^k < n:
sort suffixes by pair:
(class[i], class[i + 2^k])
rebuild new classes
remove sentinel row if your final API wants suffixes of the original string only
return sa
Kasai LCP¶
build_lcp(s, sa):
rank[sa[i]] = i for all i
k = 0
for i from 0 to n - 1:
if rank[i] is the last suffix:
k = 0
continue
j = sa[rank[i] + 1]
while i + k < n and j + k < n and s[i + k] == s[j + k]:
k++
lcp[rank[i]] = k
if k > 0:
k--
If you instead run Kasai on a sentinel-extended string, the bounds checks can be hidden by the sentinel convention. The important thing is to keep the pseudocode consistent with the string model you chose earlier.
Distinct Substrings¶
answer = n * (n + 1) / 2
for each value x in lcp:
answer -= x
Longest Repeated Substring¶
best = max(lcp)
take one suffix position where lcp is best
print that many characters from the corresponding suffix
Implementation Notes¶
1. Decide Sentinel Policy Early¶
The cleanest doubling implementations append a sentinel like $ that is smaller than all real characters.
Be consistent about:
- whether the sentinel suffix stays in the final
sa - whether
lcpis computed on the original string or the sentinel-extended one
The repo starter template currently uses the other common policy:
- keep the original string unchanged
- and use second-half rank
-1wheni + 2^kgoes out of range
Both conventions are fine. Many off-by-one bugs come from starting with one convention and mentally switching to the other halfway through.
2. sa And rank Are Inverses, Not Alternatives¶
Keep this mental picture:
sa[pos]-> where is the suffix starting index at sorted positionpos?rank[i]-> what is the sorted position of suffixi?
Confusing these two arrays is one of the most common implementation mistakes.
3. Doubling Stores Class Information, Not Literal Prefix Strings¶
At level k, do not think:
- "I stored the first
2^kcharacters"
Think:
- "I stored a compact id for the equivalence class of the first
2^kcharacters"
That is why the algorithm is efficient.
4. Kasai Uses Original Index Order For A Reason¶
Kasai scans suffixes by starting index, not by suffix-array order, because moving from i to i+1 lets us reuse k - 1.
That reuse is the entire linear-time trick.
5. Suffix Array Is Exact But Heavy¶
Before reaching for it, ask:
- do I really need suffix order?
- or do I only need exact matching, borders, or equality checks?
If the narrower problem can be solved by:
- KMP
- Z-function
- hashing
then suffix array may be unnecessary overhead.
6. Suffix Array Versus Suffix Automaton¶
This is a useful judgment call:
- suffix array is excellent for static lexicographic order, repeated-substring structure, and LCP reasoning
- suffix automaton is excellent for substring-state counting, end-position structure, and state-DP style reasoning
If the task reads like "sort suffixes" or "use adjacent suffix common prefixes", suffix array is often the more natural tool.
Practice Archetypes¶
The most common suffix-array-flavored tasks are:
- distinct substrings
- longest repeated substring
- lexicographic suffix order
- arbitrary suffix-LCP with RMQ
- static string with many exact order queries
The strongest contest smell is:
- one static string
- many questions about sorted suffix structure or repeated substrings
References And Repo Anchors¶
Research sweep refreshed on 2026-04-24.
Course:
Reference:
Primary reading pointers:
- Manber and Myers: Suffix Arrays: A New Method for On-Line String Searches
- Kasai et al.: Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications
Practice:
Repo anchors:
- practice ladder: Suffix Array And LCP ladder
- practice note: Repeating Substring
- starter template: suffix-array.cpp
- notebook refresher: String cheatsheet