Mutual Information, Conditional Entropy, and Data Processing

How conditional entropy measures remaining uncertainty, how mutual information measures uncertainty reduction, and why post-processing cannot create information.
Modified

April 26, 2026

Keywords

mutual information, conditional entropy, data processing inequality, information flow, dependence

1 Role

This is the second page of the Information Theory module.

Its job is to move from one-distribution quantities to two-variable structure:

  • how much uncertainty remains after observing something else
  • how much information one variable carries about another
  • why post-processing cannot create information that was not already there

These are the core objects behind dependence, representation quality, feature usefulness, and communication through noise.

2 First-Pass Promise

Read this page after Entropy, Cross-Entropy, and KL Divergence.

If you stop here, you should still understand:

  • what conditional entropy measures
  • what mutual information measures
  • why mutual information is always nonnegative
  • why the data processing inequality is one of the most reusable ideas in the whole field

3 Why It Matters

A large fraction of information-theory language is really about one question:

how much does observing Y tell us about X?

That question shows up in several disguises:

  • a sensor observing a hidden state
  • a noisy communication channel
  • a feature trying to predict a label
  • a representation trying to preserve useful structure
  • a statistic trying to compress data without throwing away too much

At a first pass:

  • conditional entropy measures uncertainty left over after seeing another variable
  • mutual information measures uncertainty removed by that observation
  • data processing says that once information is lost through a channel or transformation, later post-processing cannot get it back

Those three ideas are the load-bearing bridge from entropy and KL divergence to coding, capacity, lower bounds, and many ML applications.

4 Prerequisite Recall

  • entropy H(X) measures intrinsic uncertainty in a discrete random variable X
  • KL divergence compares two distributions and is always nonnegative
  • the joint distribution P_{XY} describes a pair (X,Y)
  • the marginals P_X and P_Y describe each variable separately
  • independence means P_{XY}=P_XP_Y

5 Intuition

5.1 Conditional Entropy Measures Remaining Uncertainty

Suppose Y is a noisy observation of X.

Before observing Y, your uncertainty about X is H(X).

After observing Y, some of that uncertainty may disappear.

Conditional entropy H(X|Y) measures the uncertainty that remains on average after Y is revealed.

5.2 Mutual Information Measures Uncertainty Reduction

The difference

H(X) - H(X|Y)

is the average reduction in uncertainty about X caused by seeing Y.

That reduction is the mutual information I(X;Y).

So the clean first-pass interpretation is:

mutual information = how much knowing one variable helps with the other

5.3 Mutual Information Also Measures Departure From Independence

If X and Y are independent, then learning Y tells you nothing about X.

So mutual information should be zero.

In fact, mutual information is exactly the KL divergence between:

  • the true joint distribution P_{XY}
  • the product P_XP_Y that would hold under independence

So mutual information is also a dependence measure.

5.4 Data Processing Says Information Cannot Increase Under Post-Processing

If information flows through a chain

X -> Y -> Z

where Z is computed only from Y, then Z cannot contain more information about X than Y already did.

So

I(X;Z) <= I(X;Y)

This is the data processing inequality.

At a first pass, the right mental picture is:

post-processing may discard or rearrange information, but it cannot create new information about the original source

6 Formal Core

For this first pass, we stay with discrete random variables.

Definition 1 (Definition: Conditional Entropy) For discrete random variables X and Y, the conditional entropy of X given Y is

\[ H(X\mid Y)= -\sum_{x,y} p(x,y)\log p(x\mid y). \]

Definition 2 (Definition: Mutual Information) The mutual information between X and Y is

\[ I(X;Y)=H(X)-H(X\mid Y). \]

By symmetry, it is also

\[ I(X;Y)=H(Y)-H(Y\mid X). \]

Theorem 1 (Theorem Idea: Mutual Information Is KL Divergence From Independence) For discrete random variables X and Y,

\[ I(X;Y)=D(P_{XY}\|P_XP_Y). \]

This makes the nonnegativity of mutual information immediate from KL divergence.

Theorem 2 (Theorem Idea: Mutual Information Is Nonnegative) For discrete random variables X and Y,

\[ I(X;Y)\ge 0, \]

with equality if and only if X and Y are independent.

Theorem 3 (Theorem Idea: Chain Rule For Entropy) For discrete random variables X and Y,

\[ H(X,Y)=H(Y)+H(X\mid Y)=H(X)+H(Y\mid X). \]

This is the cleanest way to remember why conditional entropy and mutual information fit together.

Theorem 4 (Theorem Idea: Data Processing Inequality) If X -> Y -> Z forms a Markov chain, then

\[ I(X;Z)\le I(X;Y). \]

At a first pass, treat this as the formal version of:

post-processing cannot increase information about the original source

7 Worked Example

Let X be a fair bit:

\[ P(X=0)=P(X=1)=1/2. \]

Now let Y be a noisy copy of X that flips with probability 0.1.

So:

  • with probability 0.9, Y=X
  • with probability 0.1, Y\neq X

Then:

  • H(X)=1 bit because X is a fair bit
  • after observing Y, there is still some uncertainty because the channel is noisy
  • that remaining uncertainty is H(X|Y)
  • the mutual information is I(X;Y)=H(X)-H(X|Y)

What matters at first pass is the qualitative picture:

  • if the flip probability were 0, then Y would reveal X perfectly, so H(X|Y)=0 and I(X;Y)=1
  • if the flip probability were 1/2, then Y would be pure noise independent of X, so H(X|Y)=H(X)=1 and I(X;Y)=0
  • the actual case 0.1 lies in between: Y is useful, but not perfect

This is the basic communication and noisy-observation picture behind mutual information.

8 Computation Lens

When you see mutual information or a data processing argument in a paper, ask:

  1. what is the source variable and what is the observation or representation?
  2. is the claim about uncertainty reduction, dependence, or both?
  3. is the proof using I(X;Y)=H(X)-H(X|Y) or using I(X;Y)=D(P_{XY}\|P_XP_Y)?
  4. where is the Markov chain or post-processing step?
  5. is the author bounding mutual information because exact computation is hard?

Those questions usually reveal whether the argument is really about prediction, compression, communication, or lower bounds.

9 Application Lens

9.1 Communication Through Noise

In a channel, X is the transmitted message and Y is the received observation.

Mutual information measures how much of the original message survives the channel.

9.2 Feature Usefulness And Representation Learning

If Y is a feature or representation derived from raw input X, then I(X;Y) measures how much information the representation keeps about the source.

This does not solve representation learning by itself, but it explains why information quantities keep appearing in the literature.

9.3 Statistics And Lower Bounds

In statistics and learning theory, data processing helps show that if the observed data carry only limited information about a hidden parameter, then no estimator can recover that parameter too accurately.

This is one of the main bridges from basic information measures to minimax lower bounds.

10 Stop Here For First Pass

If you stop here, retain these five ideas:

  • H(X|Y) is the uncertainty left after observing Y
  • I(X;Y) is the uncertainty reduction from observing Y
  • I(X;Y)=D(P_{XY}\|P_XP_Y)
  • mutual information is zero exactly at independence
  • data processing says post-processing cannot create information about the original source

That is enough to read a surprising amount of later notation without getting lost.

11 Go Deeper

The next natural step in this module is:

The strongest adjacent live pages right now are:

12 Optional Deeper Reading After First Pass

If you want a stronger second pass on the same ideas, use:

  • MIT 6.441 lecture notes for mutual information, conditional information, and the data processing viewpoint. Checked 2026-04-25.
  • Stanford EE376A lecture notes for a clean first-course treatment of entropy, mutual information, and coding ideas. Checked 2026-04-25.
  • Stanford EE376A lecture 3 for a compact official treatment of entropy, relative entropy, and mutual information. Checked 2026-04-25.

13 Sources and Further Reading

  • MIT 6.441 lecture notes - First pass - official lecture-note index with mutual information, conditional information, and strong data-processing topics. Checked 2026-04-25.
  • Stanford EE376A: Information Theory - First pass - official course page introducing entropy, mutual information, compression, and communication. Checked 2026-04-25.
  • Stanford EE376A lecture notes - Second pass - official notes for a full first course in information theory. Checked 2026-04-25.
  • Stanford EE376A lecture 3 - Second pass - official notes focused on entropy, relative entropy, and mutual information. Checked 2026-04-25.
  • Stanford EE377 bulletin - Second pass - official current description of information-theoretic methods meeting probability and statistics. Checked 2026-04-25.
Back to top