FFT And NTT¶
FFT and NTT are the standard way to turn one expensive quadratic pair-combination into a near-linear-logarithmic pipeline:
- encode the problem as a convolution
- transform the coefficient arrays
- multiply pointwise
- inverse-transform back
This topic is really about one core idea, convolution via evaluation at roots of unity, with two implementation variants:
FFT: complex numbers, flexible, but rounding-sensitiveNTT: modular arithmetic, exact, but only under friendly modulus conditions
At A Glance¶
Suspect this topic when:
- the brute force is "for every
i, for everyj" - the quantity you need depends on
i + j,i - j, or another additive index relation - values can be bucketed into frequency arrays
- the statement is secretly asking for polynomial multiplication or generating-function coefficients
Strong contest triggers:
- count all pair sums or pair differences
- count matches between two shifted histograms
- multiply two large coefficient arrays
- compute every answer for all sums at once, not one sum at a time
Strong anti-cues:
- the data is sparse and the direct method is already small enough
- the combination rule is not additive after indexing, especially if the real law is xor or exact subset split on the boolean cube -> FWHT / XOR Convolution / Subset Convolution
- the main bottleneck is graph structure, DP state explosion, or geometry rather than pair aggregation
- you only need one or two query values and there is a much lighter direct method
What success looks like after studying this page:
- you can explain why polynomial multiplication and convolution are the same object
- you know the difference between
linearandcyclicconvolution - you can derive radix-2 FFT from the even/odd split
- you know exactly when to trust
FFT(double)and when to move toNTT
Prerequisites¶
Helpful neighboring topics:
Problem Model And Notation¶
Let
and define the linear convolution of the coefficient arrays by
where coefficients outside range are treated as 0.
Then the product polynomial is
So convolution is not just analogous to polynomial multiplication. It is polynomial multiplication, written in coefficient language.
When using a length-N transform, we also need an N-th primitive root of unity, written as \(\omega\), meaning:
- \(\omega^N = 1\)
- \(\omega^t \ne 1\) for every
0 < t < N
For complex FFT we use
For NTT we work in a finite field, usually \(\mathbb{F}_p\), and need an element of exact order N.
Butterfly Playground¶
Replay a tiny length-`4` NTT modulo `17` for coefficients `[1, 2, 3, 0]`. The goal is not to memorize a library implementation, but to see how local butterfly layers convert coefficient form into point-value form.
What to watch
Visual Reading Guide¶
What to notice:
- the algorithm never mixes all four coefficients at once; each stage only combines local pairs inside one block
- the first butterfly stage uses twiddle
1, but later stages multiply one branch by a nontrivial root of unity before combining
Why it matters:
- this is the shortest route from the recursive even/odd proof to the iterative in-place code you actually write in contests
- it also makes bit-reversal feel less magical: it is simply the layout that lets those local butterflies run in the correct order
Code bridge:
- each local pair in the visual is the standard rule
(u, v) -> (u + wv, u - wv)modulop - the iterative loops in NTT are just these butterflies repeated over
len = 2, 4, 8, ...blocks after bit-reversal reordering
Boundary:
- this widget stops at the forward transform only; convolution still needs two forward transforms, one pointwise multiplication, and one inverse transform
- for xor-style pair laws on the boolean cube, this is the wrong family entirely; reopen FWHT / XOR Convolution / Subset Convolution
From Brute Force To The Right Idea¶
Running Problem: Count All Pair Sums¶
Suppose we have two arrays of values, and we want to count how many ordered pairs produce each sum.
The direct brute force is:
for every i:
for every j:
answer[a[i] + b[j]]++
That is O(n^2) pair work.
If the values lie in a manageable range, a better model is:
f[u]: how many times valueuappears in the first arrayg[v]: how many times valuevappears in the second array
Then the number of pairs with sum s is
That formula is exactly a convolution.
Convolution Before Fourier¶
This modeling step is the real gateway:
- values become indices
- frequencies become coefficients
- pair combinations become coefficient multiplication
At this point the problem is no longer "count pairs". It is "multiply two coefficient arrays fast".
Polynomial View¶
Associate the frequency arrays with polynomials:
Then
so the coefficient of \(x^s\) is exactly the number of pairs producing sum s.
That is the central bridge:
quadratic pair counting -> polynomial multiplication -> convolution
Why Roots Of Unity Help¶
Naively, multiplying two degree-d polynomials is still quadratic.
The escape hatch is to stop thinking in coefficient form and temporarily switch to value form:
- evaluate both polynomials at many points
- multiply those values pointwise
- interpolate the product back
Roots of unity are the special evaluation points that make both evaluation and interpolation fast.
Core Invariant And Why It Works¶
The One Geometric-Series Lemma¶
If \(\omega\) is a primitive N-th root of unity, then
Why this is true:
- if
N | r, every term is1 - otherwise it is a nontrivial geometric series whose ratio is not
1, so the sum collapses to0
This lemma is the algebraic reason the inverse transform exists.
DFT, IDFT, And Orthogonality¶
For a length-N array a, define its transform by
Then the inverse formula is
The proof is one line after expanding the definition:
and the inner sum is N when k = j and 0 otherwise, by the lemma above.
Why Convolution Becomes Pointwise Multiplication¶
Let deg A + deg B < N, so there is no wrap-around.
If we evaluate both polynomials at the same N roots of unity, then for each t:
So if \(\widehat{a}\) and \(\widehat{b}\) are the transforms of the coefficient arrays, then:
is the transform of the product polynomial. Applying the inverse transform recovers the coefficient array c, which is the convolution.
This is the convolution theorem in the exact form contest code uses.
Why Radix-2 FFT Is Fast¶
Now assume N = 2^k.
Split the polynomial into even and odd coefficients:
where
Evaluate at \(\omega_N^t\):
But \(\omega_N^2\) is a primitive (N/2)-th root of unity, so the even and odd pieces are just two transforms of length N/2.
Also, since \(\omega_N^{t + N/2} = -\omega_N^t\),
So one pair of half-size transforms gives two outputs at once:
That yields the recurrence
so
From Recursive Splits To Bit-Reversal And Butterflies¶
The proof above is recursive, but most contest implementations are iterative.
Those two views are the same computation laid out differently.
- the recursive view says "first separate by the last bit of the index: even vs odd"
- the next level separates by the second-last bit
- after
klevels, the coefficients have been grouped by allklow bits
If we reorder the array by reversing the binary representation of each index, then the leaves of that recursion appear in the exact order the iterative butterflies expect.
That is why iterative FFT starts with a bit-reversal permutation: it linearizes the recursion tree so that stage len = 2, 4, 8, ... can perform all butterflies in-place.
Each butterfly is just the local rule
where the twiddle factor \(\omega\) depends on the stage and the position inside the block.
When you debug FFT code, this bridge matters:
- the proof tells you why the butterfly formulas are correct
- the bit-reversal layout tells you why the iterative loops touch the entries in the order they do
Why NTT Exists¶
The same algebra works over a finite field as long as the field contains an N-th primitive root of unity.
For a prime p, the nonzero elements of \(\mathbb{F}_p\) form a cyclic multiplicative group of size p - 1. Therefore:
- if
N | (p - 1), there exists an element of orderN - if
gis a primitive root modulop, then
has exact order N
That is the whole reason NTT is possible.
For radix-2 NTT, we want N = 2^k, so the practical condition becomes:
This is why contest NTT implementations use primes of the form
Complexity And Tradeoffs¶
- direct convolution:
O(nm) - FFT / NTT transform:
O(N log N) - convolution after padding to length
N:O(N log N) - memory:
O(N)
Practical tradeoffs:
FFT(double)is flexible and often easiest when there is no modulus in the problemNTTis exact and avoids rounding bugsNTTis constrained by modulus and transform length- if the frequency range is tiny or sparse enough, direct counting can still win
Variant Chooser¶
| Approach | Exact? | Best use case | What it needs | Main risk |
|---|---|---|---|---|
| Direct nested loops | yes | tiny arrays or tiny value range | no transform | wastes time on large quadratic pair work |
FFT(double) |
no, but usually accurate enough | unrestricted integer convolution, pair counting, histogram products | zero-padding, rounding discipline | floating-point noise |
NTT(mod p) |
yes | exact convolution modulo a friendly prime | prime p, root of unity, N | (p-1) |
not every modulus works |
| Reverse-one-array convolution | yes | difference counting or correlation-style tasks | same transform machinery | indexing mistakes |
| Cyclic convolution | yes | problems that really wrap modulo N |
no zero-padding | wrong for ordinary polynomial multiplication |
Keep the topic boundary tight:
- this page covers ordinary linear convolution by radix-2 FFT/NTT
- CRT across multiple NTT primes and Bluestein belong later
- formal power series now lives in the sibling lane Polynomial / Formal Power Series
Two indexing changes are worth seeing once before they appear in a contest:
difference / correlationtasks often become convolution only after reversing one array. If you want counts ofa_i - b_j = d, then after shifting indices into nonnegative range, convolvingfwithreverse(g)turns differences into coefficient positions.cyclic convolutionis genuinely different from linear convolution. For example, multiplying[1, 1]by[1, 1]with transform length2gives cyclic result[2, 2], while the ordinary polynomial product is linear result[1, 2, 1]. Zero-padding is exactly what prevents that wrap-around.
Worked Examples¶
Example 1: Tiny Polynomial Multiply, And Why Padding Matters¶
Let
Then the coefficient arrays are [1, 2, 3] and [4, 5], and the linear convolution is:
You can verify the coefficients directly:
- coefficient of \(x^0\):
1 * 4 = 4 - coefficient of \(x^1\):
1 * 5 + 2 * 4 = 13 - coefficient of \(x^2\):
2 * 5 + 3 * 4 = 22 - coefficient of \(x^3\):
3 * 5 = 15
So
Now the important implementation point:
- the result has length
3 + 2 - 1 = 4 - if you use a transform shorter than
4, the coefficients wrap around and you get cyclic convolution instead of linear convolution
That is why zero-padding is not a cosmetic detail. It is part of the correctness proof.
If you want to see one transform pipeline fully spelled out, use an even smaller pair:
a = [1, 1, 0, 0]b = [1, 1, 0, 0]N = 4- complex root \(\omega = i\)
Then the forward transforms are:
Pointwise multiplication gives:
Applying the inverse transform recovers:
So the nonzero coefficients are exactly the linear convolution of [1, 1] with [1, 1].
This is the whole algorithm in miniature:
- zero-pad to a legal transform length
- transform both arrays
- multiply pointwise
- inverse-transform
- discard the padded tail
Example 1B: Difference Counting Needs A Reversal¶
Suppose:
A = {0, 2}
B = {1, 2}
and we want counts of:
Use frequency arrays over values 0..2:
f = [1, 0, 1]
g = [0, 1, 1]
Reverse g:
reverse(g) = [1, 1, 0].
Now convolve:
[1, 0, 1] * [1, 1, 0] = [1, 1, 1, 1, 0].
If M = 2 is the maximum original value, then coefficient M + d counts pairs with difference d.
So:
- coefficient
3countsd = 1, giving the single pair(2, 1) - coefficient
2countsd = 0, giving the single pair(2, 2) - coefficient
0countsd = -2, giving the single pair(0, 2)
The transform did not magically understand subtraction. The reversal is exactly what turned subtraction into coefficient alignment.
Example 2: POST2 And The Modeling Pipeline¶
In POST2, we want to count ordered triples (i, j, k) such that
The direct O(n^2) pair enumeration is too slow.
The right modeling pipeline is:
- build frequency arrays for
A,B, andC - shift by a fixed offset because values can be negative
- convolve the frequency arrays of
AandB - interpret the coefficient at index
sas "how many ordered pairs sum tos" - multiply by the frequency of
C = s
So the hard part is not the transform itself. The hard part is recognizing that a pair-sum count over values is exactly a coefficient product.
That is why POST2 is the right repo anchor for this topic:
- it exercises value shifting
- it separates modeling from transform mechanics
- it shows how the final coefficients still need problem-specific interpretation
Algorithm And Pseudocode¶
The contest skeleton for both FFT and NTT is the same. Only the underlying arithmetic changes.
convolution(a, b):
choose n as the smallest power of two with n >= |a| + |b| - 1
resize a and b to length n with zeros
transform(a, invert = false)
transform(b, invert = false)
for i from 0 to n - 1:
a[i] = a[i] * b[i]
transform(a, invert = true)
truncate to the first |a_original| + |b_original| - 1 entries
return a
transform(v, invert):
bit-reverse permute v
for len = 2, 4, 8, ..., n:
choose the principal len-th root w_len
if invert:
replace w_len by its inverse
for each block of length len:
w = 1
for j from 0 to len/2 - 1:
u = v[block + j]
t = v[block + j + len/2] * w
v[block + j] = u + t
v[block + j + len/2] = u - t
w = w * w_len
if invert:
divide every entry by n
For FFT(double), w_len is a complex root of unity.
For NTT, w_len is a modular root of unity, and "divide by n" means multiply by n^{-1} mod p.
Implementation Notes¶
Linear vs Cyclic Convolution¶
This is the first question to ask before coding.
The transform itself naturally gives cyclic convolution modulo the transform length. To recover ordinary linear convolution, you must pad with zeros until
If this condition fails, high-degree coefficients wrap around and contaminate low-degree ones.
Array Bounds And Shifts¶
If the values can be negative, shift them before turning them into indices.
If the maximum shifted value is V, then the result can go up to roughly 2V, so size the frequency arrays and final interpretation logic accordingly.
FFT Rounding Policy¶
The repo template uses llround after inverse FFT. This is appropriate only when:
- the true result is integral
- coefficient growth stays within safe floating-point tolerance
If the coefficients become very large, or the problem already lives modulo a prime, prefer NTT.
FFT Or NTT In A Contest¶
Use this decision rule:
- if the statement already asks for answers modulo a friendly prime, prefer
NTT - if the final coefficients must be exact integers and coefficient growth is large, prefer
NTT - if there is no modulus and coefficient growth is moderate,
FFT(double)is usually the simplest valid choice
In other words, FFT(double) is the flexible default for real-number evaluation, while NTT is the trustworthy default for exact modular arithmetic.
A crisp contest heuristic is:
- answer required modulo a friendly prime like
998244353-> useNTT - exact integer coefficients with very large counts or tight correctness requirements -> prefer
NTT(or multiple NTT primes + CRT later) - no modulus, real-valued interpretation, or moderate integral coefficients that are safe to round ->
FFT(double)
The standard anti-example for casual FFT use is a counting problem whose true coefficients are huge and must be exact: a tiny floating error after the inverse transform can round to the wrong integer, while NTT would stay exact by construction.
Choosing An NTT Modulus¶
For a radix-2 NTT of length N = 2^k, you need:
pprimeN | (p - 1)
The official AtCoder Library convolution docs express this as:
That is exactly the contest-friendly condition for a length-2^c NTT modulo m.
Friendly-Primes Mindset¶
In practice, competitive programmers often use primes like
because a large power of two divides p - 1, which supports radix-2 lengths up to that power.
One Concrete NTT Setup¶
The repo template ntt.cpp uses:
MOD = 998244353- primitive root
g = 3
For a stage of length len, the principal stage root is
For example, when len = 8, the stage uses
The inverse transform uses \(w_{\mathrm{len}}^{-1}\) instead, and the very end multiplies every coefficient by
That is the concrete implementation meaning of the abstract condition N | (p - 1).
Bit-Reversal And Butterfly Bugs¶
Most wrong implementations fail in one of four places:
- incorrect bit-reversal permutation
- wrong sign or inverse root in the inverse transform
- forgetting the final normalization by
n - choosing
ntoo small
If the output looks almost right but not exactly, suspect the transform layer.
If the output is wildly wrong, suspect the modeling layer first.
Validation Loop¶
Whenever you write or modify a transform implementation, validate it against naive convolution on tiny random arrays.
The standard loop is:
- generate many random arrays of length at most
8or10 - compute convolution with the transform code
- compute convolution with the
O(nm)double loop - compare every coefficient
For FFT(double), this is how you catch rounding or normalization bugs early.
For NTT, this is how you catch wrong stage roots, inverse roots, or missing normalization.
Common Mistakes¶
- forgetting that the default transform computes cyclic, not linear, convolution
- padding to
max(|a|, |b|)instead of|a| + |b| - 1 - forgetting to shift negative values before building frequencies
- using FFT when exact modular arithmetic is required
- forgetting self-pair, unordered-pair, or diagonal corrections in the problem interpretation
- assuming every modulus supports NTT
- treating "polynomial multiplication" and "pair counting" as separate ideas instead of the same coefficient identity
Practice Archetypes¶
Warm-Up¶
- multiply two small polynomials: why it fits: this isolates the coefficient identity before any modeling noise
- pair-sum frequency counting: why it fits: the cleanest bridge from nested loops to convolution
Core¶
- histogram matching or correlation by reversing one array: why it fits: teaches the difference between sums and differences
- exact convolution modulo a friendly prime: why it fits: this is where NTT starts to feel natural instead of magical
Stretch¶
- POST2: why it fits: full contest pipeline with negative shifts, duplicate counts, and coefficient interpretation
- larger coefficient-growth tasks:
why it fits: forces the judgment call between
FFT(double)andNTT
References And Repo Anchors¶
- Repo anchor: FFT ladder
- Repo anchor: Canonical FFT template
- Repo anchor: Canonical NTT template
- Repo anchor: Strong repo note - POST2
- Primary: Cooley and Tukey, 1965
- Course: Stanford CS168 - Fourier Transform and Convolution
- Course: MIT 6.046J - FFT, IFFT, and Polynomial Multiplication
- Course: MIT 18.310 - The Finite Fourier Transform
- Course: MIT 18.310 - Implementing the FFT and Multiplying Numbers
- Reference: AtCoder Library - convolution
- Reference: CP-Algorithms - Fast Fourier transform
- Essay / Blog: Mark Newman - How the Cooley-Tukey FFT Algorithm Works, Part 1
- Essay / Blog: Mark Newman - How the Cooley-Tukey FFT Algorithm Works, Part 2
- Background: MIT 18.781 - Primitive Roots