Variational Inference, ELBO, and Tractable Approximation

A bridge page showing how intractable posterior inference becomes an optimization problem through variational approximations and ELBO-style objectives.

Modified

April 26, 2026

Keywords

variational-inference, ELBO, posterior-approximation, latent-variables, inference

1 Application Snapshot

Some inference problems are hard for a simple reason:

the posterior distribution you want is too expensive to compute exactly

That happens quickly in latent-variable models, hierarchical Bayesian models, and modern generative models.

Variational inference responds by changing the task:

do not compute the exact posterior directly
choose a tractable family of approximate posteriors
optimize the best approximation inside that family

This is why approximate inference so often turns back into optimization.

2 Problem Setting

Suppose a model has latent variables \(z\) and observations \(x\):

\[ p_\theta(x,z)=p(z)p_\theta(x\mid z). \]

The exact posterior

\[ p_\theta(z\mid x) \]

is often the object we want, but it may be intractable because evaluating or normalizing it requires the marginal likelihood

\[ p_\theta(x)=\int p_\theta(x,z)\,dz. \]

Variational inference introduces a tractable approximation

\[ q_\phi(z\mid x), \]

then asks:

which tractable approximate posterior is closest to the true one while still being computationally manageable?

3 Why This Math Appears

This page sits on top of several site modules:

Statistics: posterior inference and latent-variable modeling
Optimization: turning approximation quality into an objective
Information Theory: KL divergence, mutual information, and tractable bounds
Signal Processing and Estimation: hidden variables and belief summaries
Machine Learning: VAEs, representation learning, and amortized inference

So variational inference is one of the clearest places where probability language and optimization language become the same workflow.

4 Math Objects In Use

latent variable \(z\)
observed data \(x\)
model parameters \(\theta\)
variational parameters \(\phi\)
approximate posterior \(q_\phi(z\mid x)\)
KL divergence between approximate and true distributions
ELBO as a computable surrogate objective

The structural point is simple:

exact posterior inference is the ideal target
ELBO is the tractable surrogate
optimization over \(\theta\) and \(\phi\) is the computational engine

5 A Small Worked Walkthrough

Suppose you want to model images using a latent representation \(z\).

The ideal object is the posterior

\[ p_\theta(z\mid x), \]

because it tells you which latent explanations make a given image plausible.

But directly computing it may be too expensive.

So you introduce an encoder-like approximation

\[ q_\phi(z\mid x), \]

and optimize the ELBO:

\[ \mathcal{L}(x;\theta,\phi) = \mathbb{E}_{q_\phi(z\mid x)}[\log p_\theta(x\mid z)] - KL(q_\phi(z\mid x)\|p(z)). \]

Now each term gets an application-facing interpretation:

the reconstruction term rewards approximate posteriors that keep enough information to explain the data well
the KL term penalizes approximate posteriors that drift too far from the reference prior

So the optimization is balancing:

fidelity to observed data
simplicity or compression of the latent representation

That is why ELBO-based training feels like both approximate Bayesian inference and structured representation learning.

6 Implementation or Computation Note

The main computational choices here are:

Variational family choice How simple should the approximate posterior family be?
Objective choice Are you optimizing the standard ELBO, a weighted variant, or another tractable bound?
Approximation bias What posterior structure are you giving up in exchange for tractability?

Strong next bridges already live on the site:

7 Failure Modes

treating the ELBO as if it were the true posterior objective rather than a surrogate
forgetting that the variational family itself can impose strong approximation bias
reading the KL term only as “regularization” and losing its approximation meaning
confusing a tractable bound with an exact Bayesian answer
using amortized inference machinery without first identifying what posterior is being approximated

8 Paper Bridge

CS236 Lecture 5 - First pass - useful once latent-variable models and variational inference become concrete modeling tools. Checked 2026-04-26.
CS236 Lecture 6 - Bridge to modern generative models - useful once ELBO terms start to be read as design choices rather than just formulas. Checked 2026-04-26.

9 Sources and Further Reading

CS236 / Deep Generative Models - First pass - official Stanford course anchor for latent-variable modeling and variational inference. Checked 2026-04-26.
CS236 Lecture 5 - First pass - official Stanford slides on variational inference and the ELBO. Checked 2026-04-26.
CS236 Lecture 6 - Second pass - official Stanford slides on deeper variational-modeling choices. Checked 2026-04-26.
Auto-Encoding Variational Bayes - Primary source - foundational paper for the VAE and ELBO training story. Checked 2026-04-26.
Deep Variational Information Bottleneck - Bridge to representation learning - useful for the compression-style interpretation of KL penalties. Checked 2026-04-26.