Huffman / Data Compression¶
This lane is for the point where a greedy problem stops being:
- "keep every prefix feasible"
- or "repair the worst current choice"
and becomes:
- merge the two lightest pieces
- build one optimal binary prefix-code tree
- and reason about weighted depth, not about one scan invariant
The repo's exact first route for this family is:
- starter -> huffman-coding.cpp
- flagship note -> Huffman Coding Benchmark
- compare point -> Heaps And Ordered Sets
- compare point -> Prefix Constraints
This lane intentionally teaches the binary Huffman route first:
- positive symbol frequencies
- min-heap over current subtree weights
- repeated merge of the two smallest trees
- weighted external path length and one deterministic code assignment
It does not start from:
- full file compressors
- bitstream serialization
- arithmetic coding
- LZW or run-length coding
- entropy proofs beyond the contest-facing greedy story
At A Glance¶
- Use when: the objective is one optimal binary prefix code or one optimal merge tree from symbol frequencies
- Core worldview: every merge pays its combined weight once more, so the total optimum is exactly the sum of merge costs
- Main tools: min-heap, merge tree, leaf depths, deterministic tie-break policy
- Typical complexity:
O(n log n)fornsymbols - Main trap: confusing "build one optimal prefix-code tree" with "implement a production compressor"
Strong study signals:
- "repeatedly take the two smallest weights"
- "optimal prefix code"
- "weighted path length / expected code length"
- "classic Huffman coding from the data-compression chapter"
Strong anti-cues:
- the task is really one range or prefix feasibility scan -> Prefix Constraints
- the point is actual byte-level compression format details
- the alphabet changes online and merge history is not the right object
Prerequisites¶
- Greedy overview, because you should already be comfortable with exchange-style greedy thinking
- Heaps And Ordered Sets, because the implementation route is one disciplined min-heap process
- Sorting helps as a lighter compare point, because many merge problems begin by asking what order matters
Helpful compare points:
- Heaps And Ordered Sets: use this when the main question is only priority-queue mechanics, not greedy optimality
- Prefix Constraints: use that family when feasibility lives in one scan over ordered input, not one merge forest
- Minimum Spanning Tree: another greedy tree-building family, but the proof lens is cut safety rather than weighted depth
Problem Model¶
The canonical task in this lane is:
- you have
nsymbols - symbol
ihas frequencyf[i] > 0 - build one binary prefix code minimizing:
sum(f[i] * code_length[i])
Equivalently:
- build one full binary tree whose leaves are the symbols
- minimize weighted leaf depth
That is the exact first model this repo teaches.
The Greedy Pivot¶
The central greedy fact is:
- in some optimal tree, the two lightest symbols can be made siblings at maximum depth
So if you merge those two symbols into one combined weight:
w = x + y
the remaining problem is the same problem on one smaller instance.
That is why the algorithm becomes:
- keep all current trees in a min-heap by weight
- remove the two lightest
- merge them
- push their parent back
- repeat until one tree remains
This is the exact greedy backbone of Huffman coding.
Core Invariants¶
The repo starter keeps four contest-facing invariants.
1. The Heap Stores The Current Forest¶
Every heap item means:
- one subtree
- one total frequency
- one deterministic tie-break key for reproducible output
So the heap is not just "numbers"; it is the current reduced instance.
2. Every Merge Adds Its Weight Once To The Final Objective¶
If two subtrees of weights a and b are merged, then all leaves inside them go one level deeper.
So that merge increases the final weighted path length by exactly:
a + b
This is the fastest sanity check in the whole lane.
3. The Final Cost Equals The Sum Of All Merge Costs¶
You can compute the optimum either by:
- DFS on final leaf depths
- or summing every internal merge weight
Those two numbers must agree.
4. One-Symbol Input Needs A Convention¶
If n = 1, the mathematically clean codeword is the empty string.
This repo prints:
- weighted cost
0 - code
-
so the output stays visible and unambiguous.
Complexity And Tradeoffs¶
With n symbols:
- heap-based build:
O(n log n) - DFS to assign codes:
O(n) - memory:
O(n)
Rule of thumb:
- one optimal merge / prefix-code tree -> this lane
- one scan with prefix feasibility -> Prefix Constraints
- one PQ process without a proof obligation -> reopen Heaps And Ordered Sets
Worked Example¶
Classic frequencies:
45 13 12 16 9 5
The deterministic repo starter builds one valid optimal code such as:
0
101
100
111
1101
1100
with weighted cost:
224
The exact codes depend on the tie policy, but the minimum cost is the real invariant.
Common Pitfalls¶
- proving only that the smallest pair "looks good" instead of using the sibling-at-maximum-depth reduction
- forgetting that multiple optimal code assignments can exist under ties
- trying to teach file headers, byte packing, or decompression formats in the first lane
- mishandling the one-symbol case
Exit Criteria¶
After this section, you should be able to:
- explain why the two smallest weights are merged first
- compute weighted path length both from merge sums and from final depths
- implement one deterministic binary Huffman builder with a min-heap
- know why this lane is about greedy optimal merge, not about industrial compression tooling
Go Deeper¶
- Reference: Algorithms 4/e - Huffman
- Reference: Algorithms 4/e - Data Compression slides
- Reference: Cornell CS312 - Priority Queues, Heaps, Huffman Coding
- Practice: Greedy ladders