Huffman / Data Compression¶

This lane is for the point where a greedy problem stops being:

"keep every prefix feasible"
or "repair the worst current choice"

and becomes:

merge the two lightest pieces
build one optimal binary prefix-code tree
and reason about weighted depth, not about one scan invariant

The repo's exact first route for this family is:

starter -> huffman-coding.cpp
flagship note -> Huffman Coding Benchmark
compare point -> Heaps And Ordered Sets
compare point -> Prefix Constraints

This lane intentionally teaches the binary Huffman route first:

positive symbol frequencies
min-heap over current subtree weights
repeated merge of the two smallest trees
weighted external path length and one deterministic code assignment

It does not start from:

full file compressors
bitstream serialization
arithmetic coding
LZW or run-length coding
entropy proofs beyond the contest-facing greedy story

At A Glance¶

Use when: the objective is one optimal binary prefix code or one optimal merge tree from symbol frequencies
Core worldview: every merge pays its combined weight once more, so the total optimum is exactly the sum of merge costs
Main tools: min-heap, merge tree, leaf depths, deterministic tie-break policy
Typical complexity: O(n log n) for n symbols
Main trap: confusing "build one optimal prefix-code tree" with "implement a production compressor"

Strong study signals:

"repeatedly take the two smallest weights"
"optimal prefix code"
"weighted path length / expected code length"
"classic Huffman coding from the data-compression chapter"

Strong anti-cues:

the task is really one range or prefix feasibility scan -> Prefix Constraints
the point is actual byte-level compression format details
the alphabet changes online and merge history is not the right object

Prerequisites¶

Greedy overview, because you should already be comfortable with exchange-style greedy thinking
Heaps And Ordered Sets, because the implementation route is one disciplined min-heap process
Sorting helps as a lighter compare point, because many merge problems begin by asking what order matters

Helpful compare points:

Heaps And Ordered Sets: use this when the main question is only priority-queue mechanics, not greedy optimality
Prefix Constraints: use that family when feasibility lives in one scan over ordered input, not one merge forest
Minimum Spanning Tree: another greedy tree-building family, but the proof lens is cut safety rather than weighted depth

Problem Model¶

The canonical task in this lane is:

you have n symbols
symbol i has frequency f[i] > 0
build one binary prefix code minimizing:

sum(f[i] * code_length[i])

Equivalently:

build one full binary tree whose leaves are the symbols
minimize weighted leaf depth

That is the exact first model this repo teaches.

The Greedy Pivot¶

The central greedy fact is:

in some optimal tree, the two lightest symbols can be made siblings at maximum depth

So if you merge those two symbols into one combined weight:

w = x + y

the remaining problem is the same problem on one smaller instance.

That is why the algorithm becomes:

keep all current trees in a min-heap by weight
remove the two lightest
merge them
push their parent back
repeat until one tree remains

This is the exact greedy backbone of Huffman coding.

Core Invariants¶

The repo starter keeps four contest-facing invariants.

1. The Heap Stores The Current Forest¶

Every heap item means:

one subtree
one total frequency
one deterministic tie-break key for reproducible output

So the heap is not just "numbers"; it is the current reduced instance.

2. Every Merge Adds Its Weight Once To The Final Objective¶

If two subtrees of weights a and b are merged, then all leaves inside them go one level deeper.

So that merge increases the final weighted path length by exactly:

a + b

This is the fastest sanity check in the whole lane.

3. The Final Cost Equals The Sum Of All Merge Costs¶

You can compute the optimum either by:

DFS on final leaf depths
or summing every internal merge weight

Those two numbers must agree.

4. One-Symbol Input Needs A Convention¶

If n = 1, the mathematically clean codeword is the empty string.

This repo prints:

weighted cost 0
code -

so the output stays visible and unambiguous.

Complexity And Tradeoffs¶

With n symbols:

heap-based build: O(n log n)
DFS to assign codes: O(n)
memory: O(n)

Rule of thumb:

one optimal merge / prefix-code tree -> this lane
one scan with prefix feasibility -> Prefix Constraints
one PQ process without a proof obligation -> reopen Heaps And Ordered Sets

Worked Example¶

Classic frequencies:

45 13 12 16 9 5

The deterministic repo starter builds one valid optimal code such as:

with weighted cost:

The exact codes depend on the tie policy, but the minimum cost is the real invariant.

Common Pitfalls¶

proving only that the smallest pair "looks good" instead of using the sibling-at-maximum-depth reduction
forgetting that multiple optimal code assignments can exist under ties
trying to teach file headers, byte packing, or decompression formats in the first lane
mishandling the one-symbol case

Exit Criteria¶

After this section, you should be able to:

explain why the two smallest weights are merged first
compute weighted path length both from merge sums and from final depths
implement one deterministic binary Huffman builder with a min-heap
know why this lane is about greedy optimal merge, not about industrial compression tooling

Go Deeper¶

Reference: Algorithms 4/e - Huffman
Reference: Algorithms 4/e - Data Compression slides
Reference: Cornell CS312 - Priority Queues, Heaps, Huffman Coding
Practice: Greedy ladders