Variational Objectives, ELBO, and Information Bounds

How the ELBO turns intractable latent-variable inference into optimization, and how variational bounds connect KL, mutual information, and representation tradeoffs in modern ML.

Modified

April 26, 2026

Keywords

ELBO, variational inference, VAE, information bottleneck, variational bound

1 Role

This is the sixth page of the Information Theory module.

Its job is to show how classical information quantities reappear inside modern ML objectives:

KL divergence as a regularizer
mutual-information-like tradeoffs as compression language
tractable lower or upper bounds replacing intractable exact quantities

This is the bridge from information theory into variational inference, VAEs, and bottleneck-style objectives.

2 First-Pass Promise

Read this page after Rate-Distortion and Representation Tradeoffs.

If you stop here, you should still understand:

why latent-variable learning often leads to an intractable posterior
what the ELBO is
why ELBO turns inference into optimization
how KL-based penalties can act as information bounds in representation learning

3 Why It Matters

Many modern ML objectives look mysterious until you notice the same few ingredients repeating:

expected log-likelihood or reconstruction
KL divergence between an encoder and a prior
a lower bound on an intractable log-marginal likelihood
a compression-style penalty on a latent representation

This is the variational viewpoint.

At a first pass:

exact posterior inference is often too hard
variational inference replaces the true posterior with a tractable approximation
ELBO gives an objective we can actually optimize
information bounds explain why KL penalties often behave like representation bottlenecks

So this page is where information theory stops being only about coding theorems and becomes part of practical generative-model training.

4 Prerequisite Recall

KL divergence measures mismatch between distributions
mutual information measures retained dependence
rate-distortion balanced fidelity against compression
in a latent-variable model, the hard object is usually the posterior p(z|x), not just the prior or likelihood

5 Intuition

5.1 Latent Variables Give Expressive Models But Hard Posteriors

Suppose we model data with a latent variable z:

\[ p_\theta(x,z)=p(z)p_\theta(x|z). \]

This can make the model expressive, but the posterior

\[ p_\theta(z|x) \]

is often intractable.

So the real problem becomes:

how do we learn and infer when exact posterior computation is unavailable?

5.2 Variational Inference Replaces The Hard Posterior With A Tractable Family

We introduce an approximate posterior q_\phi(z|x) chosen from a manageable family.

Then we optimize over \phi and model parameters \theta instead of trying to compute the exact posterior directly.

This converts inference into optimization.

5.3 ELBO Is A Lower Bound On Log Evidence

The log marginal likelihood \log p_\theta(x) is the quantity we would like to optimize, but it is often hard to evaluate.

The ELBO gives a tractable surrogate that always sits below it.

So the first-pass picture is:

maximize a lower bound that becomes tight when the variational posterior matches the true posterior

5.4 KL Penalties Often Behave Like Information Penalties

In VAE-like objectives, the KL term

\[ KL(q_\phi(z|x)\|p(z)) \]

discourages the latent representation from carrying arbitrarily much specific information about x.

That is why these objectives often feel like:

reconstruction pressure versus compression pressure

This is exactly the same qualitative shape we saw in rate-distortion tradeoffs.

6 Formal Core

Definition 1 (Definition: Evidence Lower Bound) For a latent-variable model with variational posterior q_\phi(z|x),

\[ \log p_\theta(x) = \mathbb{E}_{q_\phi(z|x)}\left[\log p_\theta(x,z)-\log q_\phi(z|x)\right] + KL\!\left(q_\phi(z|x)\,\|\,p_\theta(z|x)\right). \]

The ELBO is the first term:

\[ \mathcal{L}(x;\theta,\phi) = \mathbb{E}_{q_\phi(z|x)}\left[\log p_\theta(x,z)-\log q_\phi(z|x)\right]. \]

Since the KL term is nonnegative, the ELBO is a lower bound on \log p_\theta(x).

Theorem 1 (Theorem Idea: ELBO Is Tight When The Variational Posterior Is Exact) Because

\[ \log p_\theta(x)=\mathcal{L}(x;\theta,\phi)+KL(q_\phi(z|x)\|p_\theta(z|x)), \]

the bound is tight exactly when

\[ q_\phi(z|x)=p_\theta(z|x). \]

So maximizing ELBO is a way to both fit the model and pull the approximate posterior toward the true one.

Theorem 2 (Theorem Idea: Standard VAE Decomposition) If

\[ p_\theta(x,z)=p(z)p_\theta(x|z), \]

then the ELBO becomes

\[ \mathcal{L}(x;\theta,\phi) = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - KL(q_\phi(z|x)\|p(z)). \]

At first pass, read this as:

reconstruction or data fit term
minus a complexity or compression term

Theorem 3 (Theorem Idea: KL To A Reference Prior Gives An Information Upper Bound) Let q(z)=\int q(z|x)p(x)\,dx be the aggregate latent marginal. For any reference distribution r(z),

\[ \mathbb{E}_{p(x)}KL(q(z|x)\|r(z)) = I(X;Z)+KL(q(z)\|r(z)). \]

Therefore,

\[ I(X;Z)\le \mathbb{E}_{p(x)}KL(q(z|x)\|r(z)). \]

This is one of the cleanest first-pass explanations for why KL penalties can act as tractable information bounds.

7 Worked Example

Take a VAE-style latent model for images.

The objective for one input x looks like:

\[ \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - KL(q_\phi(z|x)\|p(z)). \]

Now read the two terms qualitatively:

if the reconstruction term dominates, the model is pushed to store more detail about x in z
if the KL term dominates, the model is pushed to keep z closer to a simple prior and therefore more compressed

So the model is balancing:

fidelity to x versus simplicity or compressibility of z

That is why ELBO-style training feels so close to rate-distortion language.

8 Computation Lens

When a paper shows an ELBO or variational bound, ask:

what exact quantity is intractable?
what variational family is being introduced?
what is the bound direction: lower bound on likelihood, upper bound on information, or both?
what do the likelihood or reconstruction term and the KL term each encourage?
is the KL penalty acting only as regularization, or also as a proxy for a compression constraint?

Those questions usually decode the objective faster than reading the notation line by line.

9 Application Lens

9.1 Variational Inference

ELBO is the standard route for turning approximate Bayesian inference in latent-variable models into an optimization problem.

9.2 Variational Autoencoders

VAE objectives are the most visible modern example: reconstruction plus KL-to-prior, trained end to end with stochastic gradients.

9.3 Information Bottleneck And Representation Learning

When a model is asked to preserve task-relevant information while staying compressed, variational upper or lower bounds make the objective tractable even when the exact mutual information is hard to compute.

10 Stop Here For First Pass

If you stop here, retain these five ideas:

latent-variable models often make exact posterior inference intractable
ELBO is a tractable lower bound on log evidence
ELBO becomes tight when the variational posterior matches the true posterior
in VAE form, ELBO looks like reconstruction minus KL regularization
KL-to-prior penalties can upper-bound retained information and therefore act like compression terms

That is enough to read many variational objectives in modern ML papers without getting lost.

11 Go Deeper

The next natural step in this module is:

Information-Theoretic Lower Bounds in Statistics, Learning, and Communication

The strongest adjacent live pages right now are:

12 Optional Deeper Reading After First Pass

If you want a stronger second pass on the same ideas, use:

Stanford CS236 lecture 5 for a concise official treatment of latent-variable models and ELBO. Checked 2026-04-25.
Stanford CS236 lecture 6 for variational inference and learning in deep generative models. Checked 2026-04-25.
Auto-Encoding Variational Bayes for the primary modern VAE reference. Checked 2026-04-25.
Deep Variational Information Bottleneck for a primary reference on variational information-bottleneck style objectives. Checked 2026-04-25.

13 Sources and Further Reading

Stanford CS236 lecture 5 - First pass - official Stanford slide deck for latent-variable models and the ELBO viewpoint. Checked 2026-04-25.
Stanford CS236 lecture 6 - First pass - official Stanford slide deck for variational inference in deep generative models. Checked 2026-04-25.
Auto-Encoding Variational Bayes - Second pass - primary paper that made ELBO-style latent-variable training scalable and practical. Checked 2026-04-25.
Deep Variational Information Bottleneck - Second pass - primary paper showing how variational approximations can make bottleneck objectives tractable. Checked 2026-04-25.