Parallel Algorithms¶
This lane exists as a textbook-breadth advanced lane, not as a claim that this repo now teaches full multicore engineering, GPU kernels, or concurrent programming.
The repo's exact first route is intentionally narrow:
- one associative operator
- one array
- one Blelloch scan tree
- one
work / spanlens - one classical simulator that preserves the parallel structure
That means this page is not:
- an OpenMP tutorial
- a CUDA page
- a lock-free data-structures page
- a distributed-systems page
- a full lane for parallel graph algorithms or parallel sorting frameworks
It is the first clean algorithmic route for:
- PRAM-style thinking
- work and span
- Brent's theorem as a sanity lens
- Blelloch up-sweep / down-sweep scan
- one canonical parallel primitive that still fits honestly inside a sequential C++ repo
with Parallel Prefix Sum Benchmark as the flagship benchmark.
This page sits next to:
- Prefix Sums because scan is the parallel sibling of the ordinary cumulative-sum idea
- Bit-Parallelism / Bitset Optimization because both lanes broaden the usual scalar execution model, but in very different ways
- Algorithm Engineering when the real question is wall-clock performance and hardware behavior rather than one clean theoretical primitive
At A Glance¶
- Use when:
- you want the first honest breadth benchmark for parallel algorithm design
- the key lesson is
work O(n), span O(log n)on an associative scan - you want to preserve the shape of a parallel tree computation in a sequential simulator
- you are comfortable with this being theory-first, not a contest-core weapon
- Prerequisites:
- Reasoning
- Prefix Sums
- Bit-Parallelism / Bitset Optimization only as a compare-point sibling
- Boundary with nearby pages:
- use this page when the exact lesson is one PRAM-style scan benchmark with
work / span - use Prefix Sums when the job is just ordinary cumulative sums in a sequential contest solution
- use Algorithm Engineering when the real problem is cache behavior, profiling, or multicore performance work
- Strongest cues:
parallel prefix sumscanwork-efficientspan,depth, orBrent's theoremup-sweep / down-sweep- Strongest anti-cues:
- the problem is simply asking for one ordinary prefix-sum array
- the real topic is SIMD bit tricks rather than processor-level parallelism
- the task needs practical threading, synchronization, or hardware APIs
- the lane would have to pretend a sequential C++ snippet already proves real speedup
- Success after studying this page:
- you can state the difference between
workandspan - you can explain why scan is the first canonical parallel primitive
- you can implement or recognize a Blelloch exclusive scan
- you know why this repo route is simulator-first and breadth-only
Problem Model And Notation¶
The exact starter route in this repo uses:
- one array
a[0..n-1] - one associative operator, first fixed to
+ - identity element
0 - one exclusive scan:
So for:
the exclusive scan is:
The exact route pads the array to the next power of two, then runs the classic Blelloch tree:
- up-sweep to build partial sums
- place identity at the root
- down-sweep to push prefix information back to leaves
On the intended parallel model, this gives:
O(n)workO(log n)span
The repo's implementation is a sequential emulator of that structure.
This is deliberate.
Why:
- it preserves the right algorithmic invariant
- it keeps the lane honest inside a standard C++ repo
- it teaches the first real parallel primitive before any hardware-specific tooling
Why This Exists Next To Prefix Sums¶
Prefix Sums teaches the sequential route:
- one left-to-right pass
O(n)time- no special execution model
This page teaches the breadth-first parallel lens:
- the operator must be associative
- the computation is organized as a tree
- the key measures are not just time and memory, but
workandspan
So the division of labor is:
- prefix-sums page: contest-core sequential cumulative sums
- parallel-algorithms page: first PRAM-style scan benchmark with work/span analysis
Core Route In This Repo¶
The exact route is:
- pad to one power-of-two length
m - copy the input into the first
nleaves and fill the rest with identities - run the up-sweep:
- each internal node stores the sum of its two children
- set the root to identity
- run the down-sweep:
- left child gets the parent's prefix
- right child gets
parent_prefix + left_subtree_sum - read the first
nleaves as the exclusive scan
The exact starter contract is intentionally narrow:
- one operator only: addition
- one array only
- one Blelloch scan tree only
- one exclusive-scan benchmark only
The exact non-promises matter:
- no true multithreading
- no claim of measured parallel speedup
- no full coverage of list ranking, tree contraction, or parallel graph algorithms
- no hardware/runtime API layer
Core Invariants¶
1. Associativity Is The Real Contract¶
The tree re-associates the computation.
So the exact route only makes sense because:
Without associativity, the tree shape changes the result.
2. The First Route Is Exclusive Scan¶
This lane does not start with "whatever prefix variant is convenient."
It starts with Blelloch's exclusive route because:
- the root is reset to identity cleanly
- the down-sweep invariant is clearer
- inclusive scan is just one tiny post-step away if needed
3. The Repo Route Preserves Structure, Not Hardware Speedup¶
This repo implementation is sequential.
That must be said directly.
The point is:
- preserve the algorithmic shape
- expose the correct
work / spanbenchmark - keep one parallel-breadth topic concrete
not:
- pretend the snippet alone proves real multicore performance
Work, Span, And Brent's Theorem¶
On the intended parallel model:
work= total operations across all processorsspan= longest dependency chain
For Blelloch scan on n values:
- work =
O(n) - span =
O(log n)
Brent's theorem gives the basic simulator lens:
where:
Wis total workSis spanpis the processor count
The theorem is not the whole subject, but it is the right first sanity check:
- low span matters
- work-efficiency still matters
- too much parallelism with too much wasted work is not the ideal target
Worked Example: Parallel Prefix Sum Benchmark¶
Use Parallel Prefix Sum Benchmark.
Take:
The exact repo route computes the exclusive scan:
The key checkpoints are:
- after the up-sweep, the root holds the total sum
25 - after replacing the root with
0 - the down-sweep propagates the correct prefix before each leaf
That is the first benchmark the lane wants you to reopen quickly.
Variant Chooser¶
| Situation | Best first move | Why it fits | Where it fails |
|---|---|---|---|
one first parallel primitive with clean work / span analysis |
use this lane | Blelloch scan is exactly the intended benchmark | weak if the real goal is full multicore engineering or GPU kernels |
| one ordinary cumulative-sum array for a contest problem | use Prefix Sums | the sequential route is simpler and is the real contest tool | weak if the lesson is specifically parallel structure |
| one huge boolean-state DP row packed into machine words | use Bit-Parallelism / Bitset Optimization | bitset packing is the real performance idea there | weak if the lesson is processor-level tree parallelism |
| the real question is measured throughput, caches, profiling, or multicore implementation | use Algorithm Engineering | that page owns practical performance work | weak if the missing piece is still the first PRAM-style benchmark |
Complexity And Tradeoffs¶
For input length n, after padding to m = 2^k >= n:
- intended parallel work =
O(m) - intended parallel span =
O(log m) - sequential emulator runtime =
O(m)
The main tradeoff is:
- very honest breadth coverage versus
- zero pretense that this is already a production parallel runtime
That is why the lane belongs in advanced and breadth.
Main Traps¶
- forgetting that associativity is the real contract
- mixing up exclusive and inclusive scan semantics
- overclaiming the sequential simulator as a real speed benchmark
- drifting into GPU / threading APIs when the route is still just scan + work/span
- ignoring padding and then reasoning as if the tree structure stayed clean
Reopen Path¶
- Topic page: Parallel Algorithms
- Practice ladder: Parallel Algorithms ladder
- Starter template: blelloch-exclusive-scan.cpp
- Notebook refresher: Parallel Algorithms hot sheet
- Compare points:
- Prefix Sums
- Bit-Parallelism / Bitset Optimization
- Parallel Prefix Sum Benchmark