Entropy, Cross-Entropy, and KL Divergence

How entropy measures uncertainty, how cross-entropy measures log-loss under a proposed model, and why KL divergence is the nonnegative gap between truth and mismatch.
Modified

April 26, 2026

Keywords

entropy, cross-entropy, KL divergence, relative entropy, log-loss

1 Role

This is the first page of the Information Theory module.

Its job is to introduce the three information measures that will keep reappearing everywhere else:

  • entropy
  • cross-entropy
  • KL divergence

These are the smallest reusable language pieces for uncertainty, mismatch, coding cost, and log-loss.

2 First-Pass Promise

Read this page first in the module.

If you stop here, you should still understand:

  • what entropy measures
  • what cross-entropy measures
  • why KL divergence is a gap between two distributions
  • why cross-entropy equals entropy plus KL divergence

3 Why It Matters

These quantities appear under many different names across the site:

  • uncertainty
  • information content
  • coding cost
  • negative log-likelihood
  • log-loss
  • regularized or variational objectives

Without one clean picture, it is easy to memorize formulas but miss the structure.

This page gives that structure.

At a first pass:

  • entropy measures how uncertain a distribution is
  • cross-entropy measures how expensive it is to code or predict data from P using a model Q
  • KL divergence measures the extra cost of using Q when the truth is P

That last sentence is one of the most reusable interpretations in modern ML and statistics.

4 Prerequisite Recall

  • a discrete distribution assigns probabilities p(x) to outcomes x
  • expectation under P means averaging with respect to the true distribution P
  • logarithm base 2 gives units in bits, while natural logarithm gives units in nats
  • negative log-probability acts like a surprise or coding length

5 Intuition

5.1 Entropy Measures Intrinsic Uncertainty

If a distribution is concentrated on one outcome, there is little uncertainty.

If it spreads mass across many outcomes, there is more uncertainty.

Entropy summarizes that uncertainty into one number.

5.2 Cross-Entropy Measures Prediction Or Coding Under Mismatch

Suppose the world really generates data from P, but we use a model Q.

Then cross-entropy asks:

how costly is it, on average, to encode or predict data from P as if Q were correct?

So cross-entropy is not a property of one distribution alone. It is a mismatch quantity.

5.3 KL Divergence Is The Extra Cost Of Mismatch

KL divergence compares P and Q directly.

It is zero exactly when the two distributions agree, and positive otherwise.

At a first pass, the right interpretation is:

KL divergence is the extra expected log-loss caused by using Q instead of the truth P

5.4 Cross-Entropy Splits Cleanly Into Entropy Plus KL

This is the key identity:

cross-entropy = entropy + KL divergence

So cross-entropy contains:

  • the irreducible uncertainty already present in P
  • plus the extra penalty from model mismatch

That decomposition is why these three quantities belong on one opening page.

6 Formal Core

For this first pass, we stay with discrete distributions.

Definition 1 (Definition: Entropy) For a discrete random variable X with distribution P, the entropy is

\[ H(P)= - \sum_x p(x)\log p(x). \]

Definition 2 (Definition: Cross-Entropy) For discrete distributions P and Q on the same support, the cross-entropy of P relative to Q is

\[ H(P,Q)= -\sum_x p(x)\log q(x). \]

Definition 3 (Definition: KL Divergence) The KL divergence, or relative entropy, from P to Q is

\[ D(P\|Q)= \sum_x p(x)\log\frac{p(x)}{q(x)}. \]

Theorem 1 (Theorem Idea: Cross-Entropy Decomposes Into Entropy Plus KL) For discrete distributions P and Q,

\[ H(P,Q)=H(P)+D(P\|Q). \]

This identity is one of the central algebraic facts of the whole module.

Theorem 2 (Theorem Idea: Gibbs Inequality) For discrete distributions P and Q,

\[ D(P\|Q)\ge 0, \]

with equality if and only if P=Q on the support of P.

So the true distribution minimizes cross-entropy against itself.

7 Worked Example

Let

\[ P=(0.9,0.1), \qquad Q=(0.6,0.4). \]

Then the entropy of P is

\[ H(P)= -0.9\log 0.9 - 0.1\log 0.1. \]

The cross-entropy of P relative to Q is

\[ H(P,Q)= -0.9\log 0.6 - 0.1\log 0.4. \]

And the KL divergence is

\[ D(P\|Q)=0.9\log \frac{0.9}{0.6} + 0.1\log \frac{0.1}{0.4}. \]

You do not need the exact decimal values to see the structure:

  • P is quite concentrated, so its entropy is not very large
  • Q is less concentrated and mismatched to P
  • that mismatch makes cross-entropy larger than entropy
  • the difference is exactly the KL divergence

This is the cleanest first example of:

irreducible uncertainty + mismatch penalty

8 Computation Lens

When you see one of these quantities in a paper or objective, ask:

  1. which distribution is the truth or data-generating distribution?
  2. which distribution is the model or approximation?
  3. are we measuring intrinsic uncertainty, or mismatch, or both?
  4. are the logarithms in bits or nats?
  5. is the objective really a cross-entropy or negative log-likelihood in disguise?

Those questions usually decode the notation faster than expanding formulas mechanically.

9 Application Lens

9.1 Classification And Log-Loss

Cross-entropy is the standard loss for probabilistic classification because it rewards calibrated probability assignments, not just the final label decision.

9.2 Variational Inference And Approximation

KL divergence appears whenever one distribution is being approximated by another, especially in variational methods.

9.3 Coding And Compression

Entropy is the baseline limit for ideal coding, while cross-entropy and KL explain what happens when the code is optimized for the wrong distribution.

10 Stop Here For First Pass

If you can now explain:

  • what entropy measures
  • what cross-entropy measures
  • why KL divergence is not symmetric
  • why KL is always nonnegative
  • why H(P,Q)=H(P)+D(P\|Q) matters

then this page has done its job.

11 Go Deeper

The next natural step in this module is:

The strongest adjacent live pages right now are:

12 Optional Deeper Reading After First Pass

The strongest current references connected to this page are:

  • MIT 6.441 lecture notes - official lecture-note index covering entropy, divergence, mutual information, and coding. Checked 2026-04-25.
  • Stanford EE376A: Information Theory - official course page introducing entropy, mutual information, compression, and communication. Checked 2026-04-25.
  • Stanford EE376A lecture notes - official notes for the full first-course information-theory core. Checked 2026-04-25.
  • Stanford EE376A lecture 3 - official notes focused on entropy, relative entropy, and mutual information. Checked 2026-04-25.
  • Stanford EE377 bulletin - official current course description connecting information theory with probability and statistics. Checked 2026-04-25.

13 Sources and Further Reading

  • MIT 6.441 lecture notes - First pass - official notes index for entropy, divergence, mutual information, and coding. Checked 2026-04-25.
  • Stanford EE376A: Information Theory - First pass - official course page for the first-course picture of uncertainty, compression, and communication. Checked 2026-04-25.
  • Stanford EE376A lecture notes - Second pass - official notes for a complete first pass through the field. Checked 2026-04-25.
  • Stanford EE376A lecture 3 - Second pass - official notes focused on entropy, relative entropy, and mutual information. Checked 2026-04-25.
  • Stanford EE377 bulletin - Second pass - official current description of information-theoretic tools in probability and statistics. Checked 2026-04-25.
Back to top