Variational Objectives, ELBO, and Information Bounds
ELBO, variational inference, VAE, information bottleneck, variational bound
1 Role
This is the sixth page of the Information Theory module.
Its job is to show how classical information quantities reappear inside modern ML objectives:
- KL divergence as a regularizer
- mutual-information-like tradeoffs as compression language
- tractable lower or upper bounds replacing intractable exact quantities
This is the bridge from information theory into variational inference, VAEs, and bottleneck-style objectives.
2 First-Pass Promise
Read this page after Rate-Distortion and Representation Tradeoffs.
If you stop here, you should still understand:
- why latent-variable learning often leads to an intractable posterior
- what the ELBO is
- why ELBO turns inference into optimization
- how KL-based penalties can act as information bounds in representation learning
3 Why It Matters
Many modern ML objectives look mysterious until you notice the same few ingredients repeating:
- expected log-likelihood or reconstruction
- KL divergence between an encoder and a prior
- a lower bound on an intractable log-marginal likelihood
- a compression-style penalty on a latent representation
This is the variational viewpoint.
At a first pass:
- exact posterior inference is often too hard
- variational inference replaces the true posterior with a tractable approximation
- ELBO gives an objective we can actually optimize
- information bounds explain why KL penalties often behave like representation bottlenecks
So this page is where information theory stops being only about coding theorems and becomes part of practical generative-model training.
4 Prerequisite Recall
- KL divergence measures mismatch between distributions
- mutual information measures retained dependence
- rate-distortion balanced fidelity against compression
- in a latent-variable model, the hard object is usually the posterior
p(z|x), not just the prior or likelihood
5 Intuition
5.1 Latent Variables Give Expressive Models But Hard Posteriors
Suppose we model data with a latent variable z:
\[ p_\theta(x,z)=p(z)p_\theta(x|z). \]
This can make the model expressive, but the posterior
\[ p_\theta(z|x) \]
is often intractable.
So the real problem becomes:
how do we learn and infer when exact posterior computation is unavailable?
5.2 Variational Inference Replaces The Hard Posterior With A Tractable Family
We introduce an approximate posterior q_\phi(z|x) chosen from a manageable family.
Then we optimize over \phi and model parameters \theta instead of trying to compute the exact posterior directly.
This converts inference into optimization.
5.3 ELBO Is A Lower Bound On Log Evidence
The log marginal likelihood \log p_\theta(x) is the quantity we would like to optimize, but it is often hard to evaluate.
The ELBO gives a tractable surrogate that always sits below it.
So the first-pass picture is:
maximize a lower bound that becomes tight when the variational posterior matches the true posterior
5.4 KL Penalties Often Behave Like Information Penalties
In VAE-like objectives, the KL term
\[ KL(q_\phi(z|x)\|p(z)) \]
discourages the latent representation from carrying arbitrarily much specific information about x.
That is why these objectives often feel like:
reconstruction pressure versus compression pressure
This is exactly the same qualitative shape we saw in rate-distortion tradeoffs.
6 Formal Core
Definition 1 (Definition: Evidence Lower Bound) For a latent-variable model with variational posterior q_\phi(z|x),
\[ \log p_\theta(x) = \mathbb{E}_{q_\phi(z|x)}\left[\log p_\theta(x,z)-\log q_\phi(z|x)\right] + KL\!\left(q_\phi(z|x)\,\|\,p_\theta(z|x)\right). \]
The ELBO is the first term:
\[ \mathcal{L}(x;\theta,\phi) = \mathbb{E}_{q_\phi(z|x)}\left[\log p_\theta(x,z)-\log q_\phi(z|x)\right]. \]
Since the KL term is nonnegative, the ELBO is a lower bound on \log p_\theta(x).
Theorem 1 (Theorem Idea: ELBO Is Tight When The Variational Posterior Is Exact) Because
\[ \log p_\theta(x)=\mathcal{L}(x;\theta,\phi)+KL(q_\phi(z|x)\|p_\theta(z|x)), \]
the bound is tight exactly when
\[ q_\phi(z|x)=p_\theta(z|x). \]
So maximizing ELBO is a way to both fit the model and pull the approximate posterior toward the true one.
Theorem 2 (Theorem Idea: Standard VAE Decomposition) If
\[ p_\theta(x,z)=p(z)p_\theta(x|z), \]
then the ELBO becomes
\[ \mathcal{L}(x;\theta,\phi) = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - KL(q_\phi(z|x)\|p(z)). \]
At first pass, read this as:
- reconstruction or data fit term
- minus a complexity or compression term
Theorem 3 (Theorem Idea: KL To A Reference Prior Gives An Information Upper Bound) Let q(z)=\int q(z|x)p(x)\,dx be the aggregate latent marginal. For any reference distribution r(z),
\[ \mathbb{E}_{p(x)}KL(q(z|x)\|r(z)) = I(X;Z)+KL(q(z)\|r(z)). \]
Therefore,
\[ I(X;Z)\le \mathbb{E}_{p(x)}KL(q(z|x)\|r(z)). \]
This is one of the cleanest first-pass explanations for why KL penalties can act as tractable information bounds.
7 Worked Example
Take a VAE-style latent model for images.
The objective for one input x looks like:
\[ \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - KL(q_\phi(z|x)\|p(z)). \]
Now read the two terms qualitatively:
- if the reconstruction term dominates, the model is pushed to store more detail about
xinz - if the KL term dominates, the model is pushed to keep
zcloser to a simple prior and therefore more compressed
So the model is balancing:
fidelity to x versus simplicity or compressibility of z
That is why ELBO-style training feels so close to rate-distortion language.
8 Computation Lens
When a paper shows an ELBO or variational bound, ask:
- what exact quantity is intractable?
- what variational family is being introduced?
- what is the bound direction: lower bound on likelihood, upper bound on information, or both?
- what do the likelihood or reconstruction term and the KL term each encourage?
- is the KL penalty acting only as regularization, or also as a proxy for a compression constraint?
Those questions usually decode the objective faster than reading the notation line by line.
9 Application Lens
9.1 Variational Inference
ELBO is the standard route for turning approximate Bayesian inference in latent-variable models into an optimization problem.
9.2 Variational Autoencoders
VAE objectives are the most visible modern example: reconstruction plus KL-to-prior, trained end to end with stochastic gradients.
9.3 Information Bottleneck And Representation Learning
When a model is asked to preserve task-relevant information while staying compressed, variational upper or lower bounds make the objective tractable even when the exact mutual information is hard to compute.
10 Stop Here For First Pass
If you stop here, retain these five ideas:
- latent-variable models often make exact posterior inference intractable
- ELBO is a tractable lower bound on log evidence
- ELBO becomes tight when the variational posterior matches the true posterior
- in VAE form, ELBO looks like reconstruction minus KL regularization
- KL-to-prior penalties can upper-bound retained information and therefore act like compression terms
That is enough to read many variational objectives in modern ML papers without getting lost.
11 Go Deeper
The next natural step in this module is:
The strongest adjacent live pages right now are:
12 Optional Deeper Reading After First Pass
If you want a stronger second pass on the same ideas, use:
- Stanford CS236 lecture 5 for a concise official treatment of latent-variable models and ELBO. Checked
2026-04-25. - Stanford CS236 lecture 6 for variational inference and learning in deep generative models. Checked
2026-04-25. - Auto-Encoding Variational Bayes for the primary modern VAE reference. Checked
2026-04-25. - Deep Variational Information Bottleneck for a primary reference on variational information-bottleneck style objectives. Checked
2026-04-25.
13 Sources and Further Reading
- Stanford CS236 lecture 5 -
First pass- official Stanford slide deck for latent-variable models and the ELBO viewpoint. Checked2026-04-25. - Stanford CS236 lecture 6 -
First pass- official Stanford slide deck for variational inference in deep generative models. Checked2026-04-25. - Auto-Encoding Variational Bayes -
Second pass- primary paper that made ELBO-style latent-variable training scalable and practical. Checked2026-04-25. - Deep Variational Information Bottleneck -
Second pass- primary paper showing how variational approximations can make bottleneck objectives tractable. Checked2026-04-25.