Rate-Distortion and Representation Tradeoffs

How lossy compression trades rate against fidelity, why mutual information controls the best achievable rate at a target distortion, and how this becomes a language for representation tradeoffs.

Modified

April 26, 2026

Keywords

rate distortion, lossy compression, representation learning, tradeoff, mutual information

1 Role

This is the fifth page of the Information Theory module.

Its job is to explain what happens when perfect reconstruction is too expensive or unnecessary:

how much can we compress
how much distortion must we tolerate
what is the best rate for a target fidelity level

This is where information theory turns compression into an explicit rate-versus-quality tradeoff.

2 First-Pass Promise

Read this page after Channel Coding, Capacity, and Converse Proofs.

If you stop here, you should still understand:

what a distortion measure is doing
what the rate-distortion function R(D) means
why more tolerated distortion can reduce the required rate
why rate-distortion is a natural language for representation tradeoffs

3 Why It Matters

Lossless compression asked:

how many bits are needed if we insist on exact reconstruction?

But many real problems do not need exact recovery:

images can tolerate small visual error
audio can tolerate perceptual loss
latent representations need only preserve the task-relevant structure
scientific summaries often trade fidelity for compactness

So the question changes:

if we allow some distortion, how much can we reduce the rate?

At a first pass:

rate measures how many bits we spend
distortion measures reconstruction quality loss
the rate-distortion function gives the best possible rate at a target distortion level
mutual information reappears as the quantity that characterizes this optimum

This is the clean bridge from classical lossy compression to modern language about bottlenecks and compressed representations.

4 Prerequisite Recall

entropy set the compression threshold for lossless coding
mutual information measured dependence and uncertainty reduction
channel capacity was an optimization of mutual information with an operational meaning
now we will optimize mutual information again, but under a fidelity constraint instead of a channel constraint

5 Intuition

5.1 Rate And Fidelity Pull In Opposite Directions

If you want a nearly perfect reconstruction, you need a richer description and therefore a higher rate.

If you are willing to tolerate more error, you can describe the source more coarsely and therefore use fewer bits.

So the central object should be a tradeoff curve, not a single number.

5.2 Distortion Measures What Kind Of Error Matters

The right notion of error depends on the task.

Examples:

Hamming distortion for symbol mismatches
squared error for numeric approximation
perceptual or task-specific losses in more modern settings

Rate-distortion theory does not choose the distortion for you. It tells you the best achievable rate once a distortion notion is fixed.

5.3 Mutual Information Reappears As The Compression Cost

In channel coding, mutual information measured how much information the channel can preserve.

In rate-distortion theory, mutual information measures how much information the reconstruction retains about the source.

So the clean first-pass picture is:

to keep distortion small, the reconstruction must retain enough information about the source

5.4 Zero Distortion Recovers The Lossless Story

If the allowed distortion is zero, then rate-distortion collapses back toward lossless coding.

So rate-distortion generalizes the lossless entropy story rather than replacing it.

6 Formal Core

Think of a source U and a reconstruction V, together with a distortion measure d(U,V).

Definition 1 (Definition Idea: Achievable Rate-Distortion Pair) A pair (R,D) is achievable if there exist long block codes whose rate is at most R and whose average distortion is at most D, up to arbitrarily small slack, for sufficiently large block length.

So achievability now depends on both compression rate and tolerated distortion.

Definition 2 (Definition: Rate-Distortion Function) The rate-distortion function R(D) is the infimum of all rates R such that (R,D) is achievable.

This means:

R(D) = the smallest rate needed to achieve distortion level D

Definition 3 (Definition Idea: Information Rate-Distortion Function) For a source U, the information-theoretic expression is

\[ R(D)=\min_{P_{V\mid U}: \, \mathbb{E}[d(U,V)]\le D} I(U;V). \]

At first pass, read this as:

among all reconstructions that keep distortion below D, choose the one that retains as little information as possible while still meeting the fidelity target

Theorem 1 (Theorem Idea: Rate-Distortion Theorem) For a memoryless source and a fixed distortion measure, the operational rate-distortion function equals the information expression above.

So the optimization over mutual information is not just a heuristic. It is the exact fundamental limit.

Theorem 2 (Theorem Idea: Monotonicity) The function R(D) is nonincreasing in D.

That matches intuition:

if you allow more distortion, you should not need a higher rate

7 Worked Example

Consider a Bernoulli source U ~ Ber(p) with Hamming distortion:

distortion is 0 when V=U
distortion is 1 when V!=U

At one extreme:

if D=0, you are asking for exact reconstruction, so the required rate matches the lossless story

At the other extreme:

if D is large enough that you do not really care about reproducing the source, the required rate can drop dramatically, even to 0 in degenerate cases

For intermediate D, the rate-distortion function traces a curve between these regimes.

What matters at first pass is the geometry:

tighter fidelity target -> higher rate
looser fidelity target -> lower rate
the tradeoff is fundamental, not an artifact of a particular codec

8 Computation Lens

When a paper mentions rate-distortion or a bottleneck tradeoff, ask:

what is the source variable?
what is the reconstruction or representation?
what distortion measure is actually being optimized or constrained?
is the result operational, variational, or only heuristic?
is mutual information being minimized, bounded, or approximated?

These questions usually reveal whether the paper is truly using rate-distortion theory or only borrowing its language.

9 Application Lens

9.1 Lossy Compression

This is the classical home of the theory: image, audio, and source compression under quality constraints.

9.2 Representation Learning

A representation can be viewed as a compressed proxy for the original signal. Rate-distortion language helps articulate what information is being preserved and what is being sacrificed.

9.3 Statistics And Task-Specific Summaries

Many statistical procedures compress raw data into summaries. Rate-distortion gives a principled language for asking how much fidelity is lost when compression is forced by storage, privacy, bandwidth, or computation.

10 Stop Here For First Pass

If you stop here, retain these five ideas:

rate-distortion theory studies compression when reconstruction may be imperfect
distortion specifies what kind of error matters
R(D) is the smallest rate needed to achieve distortion D
the information characterization is a constrained mutual-information minimization
rate-distortion is the classical precursor to many modern representation tradeoff stories

That is enough to read most first-pass lossy-compression statements without getting lost.

11 Go Deeper

The next natural step in this module is:

Variational Objectives, ELBO, and Information Bounds

The strongest adjacent live pages right now are:

12 Optional Deeper Reading After First Pass

If you want a stronger second pass on the same ideas, use:

MIT 6.441 Chapter 23: Rate-Distortion Theory for the cleanest official MIT entry into lossy compression. Checked 2026-04-25.
Stanford EE376A course outline to see where rate-distortion and its direct/converse parts sit in the course. Checked 2026-04-25.
Stanford EE376A lecture notes for the full official course treatment. Checked 2026-04-25.
Stanford EE376A lecture 12 for a focused official introduction to the rate-distortion function. Checked 2026-04-25.

13 Sources and Further Reading

MIT 6.441: Information Theory - First pass - official course page for the overall compression and communication structure of the field. Checked 2026-04-25.
MIT 6.441 Chapter 23: Rate-Distortion Theory - First pass - official MIT chapter specifically for the lossy-compression limit. Checked 2026-04-25.
Stanford EE376A course outline - First pass - official outline showing how rate-distortion fits after source and channel coding. Checked 2026-04-25.
Stanford EE376A lecture notes - Second pass - official full notes containing rate-distortion examples and theory. Checked 2026-04-25.
Stanford EE376A lecture 12 - Second pass - official notes focused on the rate-distortion function and examples. Checked 2026-04-25.
Stanford EE377 bulletin - Second pass - official current description of information theory meeting statistics, where lower-bound and representation themes remain relevant. Checked 2026-04-25.