Entropy, Cross-Entropy, and KL Divergence
entropy, cross-entropy, KL divergence, relative entropy, log-loss
1 Role
This is the first page of the Information Theory module.
Its job is to introduce the three information measures that will keep reappearing everywhere else:
entropycross-entropyKL divergence
These are the smallest reusable language pieces for uncertainty, mismatch, coding cost, and log-loss.
2 First-Pass Promise
Read this page first in the module.
If you stop here, you should still understand:
- what entropy measures
- what cross-entropy measures
- why KL divergence is a gap between two distributions
- why cross-entropy equals entropy plus KL divergence
3 Why It Matters
These quantities appear under many different names across the site:
- uncertainty
- information content
- coding cost
- negative log-likelihood
- log-loss
- regularized or variational objectives
Without one clean picture, it is easy to memorize formulas but miss the structure.
This page gives that structure.
At a first pass:
- entropy measures how uncertain a distribution is
- cross-entropy measures how expensive it is to code or predict data from
Pusing a modelQ - KL divergence measures the extra cost of using
Qwhen the truth isP
That last sentence is one of the most reusable interpretations in modern ML and statistics.
4 Prerequisite Recall
- a discrete distribution assigns probabilities
p(x)to outcomesx - expectation under
Pmeans averaging with respect to the true distributionP - logarithm base
2gives units inbits, while natural logarithm gives units innats - negative log-probability acts like a surprise or coding length
5 Intuition
5.1 Entropy Measures Intrinsic Uncertainty
If a distribution is concentrated on one outcome, there is little uncertainty.
If it spreads mass across many outcomes, there is more uncertainty.
Entropy summarizes that uncertainty into one number.
5.2 Cross-Entropy Measures Prediction Or Coding Under Mismatch
Suppose the world really generates data from P, but we use a model Q.
Then cross-entropy asks:
how costly is it, on average, to encode or predict data from P as if Q were correct?
So cross-entropy is not a property of one distribution alone. It is a mismatch quantity.
5.3 KL Divergence Is The Extra Cost Of Mismatch
KL divergence compares P and Q directly.
It is zero exactly when the two distributions agree, and positive otherwise.
At a first pass, the right interpretation is:
KL divergence is the extra expected log-loss caused by using Q instead of the truth P
5.4 Cross-Entropy Splits Cleanly Into Entropy Plus KL
This is the key identity:
cross-entropy = entropy + KL divergence
So cross-entropy contains:
- the irreducible uncertainty already present in
P - plus the extra penalty from model mismatch
That decomposition is why these three quantities belong on one opening page.
6 Formal Core
For this first pass, we stay with discrete distributions.
Definition 1 (Definition: Entropy) For a discrete random variable X with distribution P, the entropy is
\[ H(P)= - \sum_x p(x)\log p(x). \]
Definition 2 (Definition: Cross-Entropy) For discrete distributions P and Q on the same support, the cross-entropy of P relative to Q is
\[ H(P,Q)= -\sum_x p(x)\log q(x). \]
Definition 3 (Definition: KL Divergence) The KL divergence, or relative entropy, from P to Q is
\[ D(P\|Q)= \sum_x p(x)\log\frac{p(x)}{q(x)}. \]
Theorem 1 (Theorem Idea: Cross-Entropy Decomposes Into Entropy Plus KL) For discrete distributions P and Q,
\[ H(P,Q)=H(P)+D(P\|Q). \]
This identity is one of the central algebraic facts of the whole module.
Theorem 2 (Theorem Idea: Gibbs Inequality) For discrete distributions P and Q,
\[ D(P\|Q)\ge 0, \]
with equality if and only if P=Q on the support of P.
So the true distribution minimizes cross-entropy against itself.
7 Worked Example
Let
\[ P=(0.9,0.1), \qquad Q=(0.6,0.4). \]
Then the entropy of P is
\[ H(P)= -0.9\log 0.9 - 0.1\log 0.1. \]
The cross-entropy of P relative to Q is
\[ H(P,Q)= -0.9\log 0.6 - 0.1\log 0.4. \]
And the KL divergence is
\[ D(P\|Q)=0.9\log \frac{0.9}{0.6} + 0.1\log \frac{0.1}{0.4}. \]
You do not need the exact decimal values to see the structure:
Pis quite concentrated, so its entropy is not very largeQis less concentrated and mismatched toP- that mismatch makes cross-entropy larger than entropy
- the difference is exactly the KL divergence
This is the cleanest first example of:
irreducible uncertainty + mismatch penalty
8 Computation Lens
When you see one of these quantities in a paper or objective, ask:
- which distribution is the truth or data-generating distribution?
- which distribution is the model or approximation?
- are we measuring intrinsic uncertainty, or mismatch, or both?
- are the logarithms in bits or nats?
- is the objective really a cross-entropy or negative log-likelihood in disguise?
Those questions usually decode the notation faster than expanding formulas mechanically.
9 Application Lens
9.1 Classification And Log-Loss
Cross-entropy is the standard loss for probabilistic classification because it rewards calibrated probability assignments, not just the final label decision.
9.2 Variational Inference And Approximation
KL divergence appears whenever one distribution is being approximated by another, especially in variational methods.
9.3 Coding And Compression
Entropy is the baseline limit for ideal coding, while cross-entropy and KL explain what happens when the code is optimized for the wrong distribution.
10 Stop Here For First Pass
If you can now explain:
- what entropy measures
- what cross-entropy measures
- why KL divergence is not symmetric
- why KL is always nonnegative
- why
H(P,Q)=H(P)+D(P\|Q)matters
then this page has done its job.
11 Go Deeper
The next natural step in this module is:
The strongest adjacent live pages right now are:
12 Optional Deeper Reading After First Pass
The strongest current references connected to this page are:
- MIT 6.441 lecture notes - official lecture-note index covering entropy, divergence, mutual information, and coding. Checked
2026-04-25. - Stanford EE376A: Information Theory - official course page introducing entropy, mutual information, compression, and communication. Checked
2026-04-25. - Stanford EE376A lecture notes - official notes for the full first-course information-theory core. Checked
2026-04-25. - Stanford EE376A lecture 3 - official notes focused on entropy, relative entropy, and mutual information. Checked
2026-04-25. - Stanford EE377 bulletin - official current course description connecting information theory with probability and statistics. Checked
2026-04-25.
13 Sources and Further Reading
- MIT 6.441 lecture notes -
First pass- official notes index for entropy, divergence, mutual information, and coding. Checked2026-04-25. - Stanford EE376A: Information Theory -
First pass- official course page for the first-course picture of uncertainty, compression, and communication. Checked2026-04-25. - Stanford EE376A lecture notes -
Second pass- official notes for a complete first pass through the field. Checked2026-04-25. - Stanford EE376A lecture 3 -
Second pass- official notes focused on entropy, relative entropy, and mutual information. Checked2026-04-25. - Stanford EE377 bulletin -
Second pass- official current description of information-theoretic tools in probability and statistics. Checked2026-04-25.