Suffix Array And LCP¶

Suffix array is the contest-friendly exact structure for:

suffix order
lexicographic substring logic
repeated-substring reasoning
and many "one static string, many structural questions" problems

It is heavier than String Hashing, but much lighter conceptually than a suffix tree and often easier to explain and debug than Suffix Automaton for static-string tasks.

This page teaches the classic contest route:

build suffix array by doubling
build LCP by Kasai
reuse sa, rank, and lcp for common exact-string tasks

At A Glance¶

Reach for suffix array when:

suffixes need to be sorted exactly
repeated-substring structure matters
lexicographic order of suffixes or substrings is the main object
the string is static and preprocessing-heavy queries are acceptable

Strong contest triggers:

"longest repeated substring"
"number of distinct substrings"
"sort suffixes or rank suffixes"
"exact substring comparisons after preprocessing"
"one static string with many lexicographic or LCP-style queries"

Strong anti-cues:

you only need one narrow exact-matching task: KMP or Z-Function may be cleaner
substring equality alone is enough and probabilistic methods are acceptable: hashing may be simpler
the task is online / dynamic over the string
suffix order is overkill for the actual question

What success looks like after studying this page:

you can explain what sa, rank, and lcp each mean
you understand why doubling by equivalence classes works
you know why many explanations append a sentinel, and why this repo template instead uses virtual -1 ranks for the missing second half
you can explain Kasai's k-- reuse intuition
you can derive distinct-substring and longest-repeated-substring formulas from lcp

Prerequisites¶

Helpful neighboring topics:

Problem Model And Notation¶

Let s be a string of length n.

The suffix starting at position i is:

\[ s[i..n-1]. \]

The suffix array sa is the list of starting indices of all suffixes in lexicographic order.

So:

\[ sa[k] = \text{the starting position of the } k\text{-th smallest suffix.} \]

The inverse array is usually called rank:

\[ rank[i] = \text{the position of suffix } s[i..] \text{ inside } sa. \]

The LCP array stores longest common prefixes of adjacent suffixes in suffix-array order:

\[ lcp[k] = \text{LCP}(s[sa[k]..], s[sa[k+1]..]). \]

This array has length n - 1 if no sentinel row is kept in the final structure.

From Brute Force To The Right Idea¶

Brute Force¶

The naive way to build a suffix array is:

generate all suffix strings
sort them directly

Sorting needs O(n log n) comparisons, but each comparison may take O(n) time, so the total can reach:

\[ O(n^2 \log n). \]

This is too slow at contest sizes.

The Structural Observation¶

We do not need to compare whole suffixes from scratch at every sort step.

Instead, we can sort suffixes by:

first 1 character
then first 2 characters
then first 4 characters
then first 8 characters
and so on

This gives the doubling approach.

Why Equivalence Classes Are The Right Summary¶

Suppose we already know the equivalence class of every suffix by its first 2^k characters.

Then to compare suffixes by their first 2^{k+1} characters, it is enough to compare the pair:

\[ (c_k[i],\ c_k[i + 2^k]), \]

where c_k[i] is the class of the first half and c_k[i + 2^k] is the class of the second half.

So a longer comparison is reduced to a pair of shorter comparisons already summarized by class ids.

That is the entire reason the doubling algorithm works.

Why A Sentinel Helps¶

The cleanest contest implementation appends a sentinel character:

smaller than every real character
occurring only once at the end

This does two things:

every suffix becomes comparable without running off the end
sorting cyclic shifts of s + '$' becomes equivalent to sorting suffixes of s

That is why many textbook and CP-Algorithms explanations start with an explicit sentinel.

The repo starter template takes a slightly different but equivalent route: it does not append a sentinel to the stored string, and instead treats the missing second half rank as -1 during doubling. The teaching picture is still the same:

a shorter suffix should compare smaller once one side runs out of characters
you need one consistent rule for "nothing is here anymore"

So the sentinel model is the cleanest way to think, while the virtual--1 model is the concrete policy used in this repo snippet.

Core Invariant And Why It Works¶

Doubling Invariant¶

After iteration k, we maintain:

sa: suffixes sorted by their first 2^k characters
c_k[i]: the equivalence class of suffix i under comparison of the first 2^k characters

This means:

\[ c_k[i] = c_k[j] \]

if and only if the length-2^k prefixes of suffixes i and j are equal.

So c_k[i] is not "some helper label". It is exactly the rank of suffix i under the truncated key of length 2^k.

Why The Next Iteration Uses Rank Pairs¶

To compare suffixes i and j by 2^{k+1} characters, split each prefix into two halves of length 2^k.

Then:

\[ s[i..i+2^{k+1}-1] \]

is determined by:

\[ (c_k[i],\ c_k[i+2^k]). \]

So sorting suffixes by those rank pairs gives the correct order for the next doubling level. In other words:

c_k[i] summarizes the first half
c_k[i + 2^k] summarizes the second half
the ordered pair is a complete signature for the first 2^{k+1} characters

This is the real induction step:

assume c_k correctly ranks length-2^k prefixes
sort by pairs of those ranks
compress equal pairs into new classes c_{k+1}

That is the induction step of the construction.

When Doubling Stops¶

Once:

\[ 2^k \ge n, \]

the compared prefix length already covers every suffix completely, so the suffix order is final.

At that point all suffixes have unique relative positions.

Kasai's Invariant¶

After building sa, we want lcp.

Kasai uses the inverse array rank, and scans suffixes in original index order.

Suppose the current suffix s[i..] has LCP length k with its next suffix in suffix-array order.

If we delete the first character from both matched suffixes, we immediately get a common prefix of length k - 1 between s[i+1..] and the shifted comparison partner. The actual neighbor of s[i+1..] in suffix-array order might be different, but it cannot force the best possible LCP below that already-known k - 1 floor before we start extending again.

So when we move from suffix i to suffix i+1, the reusable part of the previous answer can only drop by 1.

That is the reason for the standard reuse line:

if k > 0: k--

before extending again.

Why Kasai Is Linear¶

Every successful character comparison increases k.

Across the whole algorithm:

k can only increase up to n
and it decreases by at most 1 per outer iteration

So the total number of comparisons is O(n).

This is the LCP analogue of the usual amortized trick in linear string algorithms:

reuse most of the previous work
only pay for frontier expansion

Complexity And Tradeoffs¶

For the classic doubling route:

suffix array: O(n \log n)
Kasai LCP: O(n)
memory: usually O(n) to O(n \log n) depending on whether intermediate class tables are stored

Practical tradeoffs:

Tool	Best when	Time	Main tradeoff
suffix array + LCP	exact static-string order / repeated structure	preprocess `O(n \log n)`	heavier than hashing for narrow tasks
string hashing	equality / quick comparisons with acceptable collision risk	light and practical	probabilistic unless carefully hardened
suffix automaton	substring-state counting / endpos-style reasoning	powerful and linear	more abstract and harder to debug for many learners
KMP / Z	narrow exact-match or prefix-structure tasks	linear and simpler	too specialized for full suffix ordering

Rule of thumb:

if the task is about suffix order or repeated-substring structure, suffix array is a strong exact static-string choice
if you only need one small exact-matching primitive, it is often overkill

Variant Chooser¶

Variant	Use it when	Main idea	Main pitfall
doubling suffix array	standard contest baseline	ranks of length `1, 2, 4, 8, ...`	sentinel / pair-comparison mistakes
suffix array + Kasai	you also need adjacent LCP information	reuse `k - 1` when shifting suffix start	mixing `sa` with `rank`
suffix array + RMQ on `lcp`	many arbitrary suffix-LCP queries	RMQ over adjacent LCPs	forgetting that arbitrary suffix LCP becomes a range minimum
cyclic-shift variant	minimal rotation / cyclic order tasks	sort cyclic shifts	mixing true suffix order with cyclic order

Worked Examples¶

Example 1: A Tiny Suffix Array¶

Take:

s = banana

The suffixes are:

0 banana
1 anana
2 nana
3 ana
4 na
5 a

Sorted:

5 a
3 ana
1 anana
0 banana
4 na
2 nana

So:

\[ sa = [5, 3, 1, 0, 4, 2]. \]

This single example is worth memorizing because it makes the meanings of sa and rank concrete immediately.

To make the doubling construction feel less magical, look at one real round on the same string.

Start by classifying suffixes by their first character:

index:   0 1 2 3 4 5
char:    b a n a n a
class:   1 0 2 0 2 0

Now build length-2 keys by pairs:

0 -> (1, 0)
1 -> (0, 2)
2 -> (2, 0)
3 -> (0, 2)
4 -> (2, 0)
5 -> (0, -1)

Sorting by those pairs gives:

5, 1, 3, 0, 2, 4

with new length-2 classes:

index:   0 1 2 3 4 5
class2:  2 1 3 1 3 0

That is the whole doubling idea in miniature:

one round starts from class ids for length 2^k
sorts by rank pairs
and produces class ids for length 2^{k+1}

Later rounds keep doing the same thing until the order stabilizes into the final suffix array.

Example 2: Distinct Substrings¶

Every suffix contributes all of its prefixes as candidate substrings.

Suffix sa[i] has:

\[ n - sa[i] \]

prefixes in total.

But among those, exactly lcp[i-1] are duplicates of prefixes already seen from the previous suffix in sorted order.

Why is the previous suffix enough?

Because if the current suffix shared a longer prefix with some even earlier suffix, then all suffixes sharing that prefix would form one contiguous block in suffix-array order. So the nearest previous suffix inside that block shares at least as much. The "largest duplicate overlap with the past" is therefore already captured by the immediate previous neighbor.

So the number of new substrings contributed by suffix sa[i] is:

\[ (n - sa[i]) - lcp[i-1] \]

with lcp[-1] treated as 0.

Equivalently:

\[ \text{distinct substrings} = \frac{n(n+1)}{2} - \sum lcp[i]. \]

This is one of the cleanest suffix-array applications.

Example 3: Longest Repeated Substring¶

If a substring appears at least twice, then there are at least two suffixes sharing it as a prefix.

In suffix-array order, all suffixes sharing a common prefix form a contiguous block.

So some adjacent pair in that block has at least that common-prefix length.

Therefore:

\[ \text{longest repeated substring length} = \max lcp[i]. \]

This is exactly the logic behind Repeating Substring.

Example 4: Arbitrary Suffix LCP From Adjacent LCP¶

Suppose suffixes i and j appear at positions rank[i] < rank[j] in the suffix array.

Then:

\[ \text{LCP}(i, j) = \min(lcp[rank[i]],\ lcp[rank[i]+1],\ \dots,\ lcp[rank[j]-1]). \]

So arbitrary suffix-LCP queries reduce to RMQ on the lcp array.

This is one of the most important conceptual bridges:

adjacent LCP information
plus RMQ
gives general suffix-LCP queries

Algorithm And Pseudocode¶

The repo starter implementation is:

suffix-array.cpp

Doubling Construction¶

build_suffix_array(s):
    append sentinel if needed
    initialize classes by first character

    for k = 0, 1, 2, ... while 2^k < n:
        sort suffixes by pair:
            (class[i], class[i + 2^k])
        rebuild new classes

    remove sentinel row if your final API wants suffixes of the original string only
    return sa

Kasai LCP¶

build_lcp(s, sa):
    rank[sa[i]] = i for all i
    k = 0

    for i from 0 to n - 1:
        if rank[i] is the last suffix:
            k = 0
            continue

        j = sa[rank[i] + 1]
        while i + k < n and j + k < n and s[i + k] == s[j + k]:
            k++
        lcp[rank[i]] = k
        if k > 0:
            k--

If you instead run Kasai on a sentinel-extended string, the bounds checks can be hidden by the sentinel convention. The important thing is to keep the pseudocode consistent with the string model you chose earlier.

Distinct Substrings¶

answer = n * (n + 1) / 2
for each value x in lcp:
    answer -= x

Longest Repeated Substring¶

best = max(lcp)
take one suffix position where lcp is best
print that many characters from the corresponding suffix

Implementation Notes¶

1. Decide Sentinel Policy Early¶

The cleanest doubling implementations append a sentinel like $ that is smaller than all real characters.

Be consistent about:

whether the sentinel suffix stays in the final sa
whether lcp is computed on the original string or the sentinel-extended one

The repo starter template currently uses the other common policy:

keep the original string unchanged
and use second-half rank -1 when i + 2^k goes out of range

Both conventions are fine. Many off-by-one bugs come from starting with one convention and mentally switching to the other halfway through.

2. `sa` And `rank` Are Inverses, Not Alternatives¶

Keep this mental picture:

sa[pos] -> where is the suffix starting index at sorted position pos?
rank[i] -> what is the sorted position of suffix i?

Confusing these two arrays is one of the most common implementation mistakes.

3. Doubling Stores Class Information, Not Literal Prefix Strings¶

At level k, do not think:

"I stored the first 2^k characters"

Think:

"I stored a compact id for the equivalence class of the first 2^k characters"

That is why the algorithm is efficient.

4. Kasai Uses Original Index Order For A Reason¶

Kasai scans suffixes by starting index, not by suffix-array order, because moving from i to i+1 lets us reuse k - 1.

That reuse is the entire linear-time trick.

5. Suffix Array Is Exact But Heavy¶

Before reaching for it, ask:

do I really need suffix order?
or do I only need exact matching, borders, or equality checks?

If the narrower problem can be solved by:

KMP
Z-function
hashing

then suffix array may be unnecessary overhead.

6. Suffix Array Versus Suffix Automaton¶

This is a useful judgment call:

suffix array is excellent for static lexicographic order, repeated-substring structure, and LCP reasoning
suffix automaton is excellent for substring-state counting, end-position structure, and state-DP style reasoning

If the task reads like "sort suffixes" or "use adjacent suffix common prefixes", suffix array is often the more natural tool.

Practice Archetypes¶

The most common suffix-array-flavored tasks are:

distinct substrings
longest repeated substring
lexicographic suffix order
arbitrary suffix-LCP with RMQ
static string with many exact order queries

The strongest contest smell is:

one static string
many questions about sorted suffix structure or repeated substrings

References And Repo Anchors¶

Research sweep refreshed on 2026-04-24.

Course:

Reference:

CP-Algorithms: Suffix Array

Primary reading pointers:

Practice:

CSES Problem Set

Repo anchors:

practice ladder: Suffix Array And LCP ladder
practice note: Repeating Substring
starter template: suffix-array.cpp
notebook refresher: String cheatsheet