Variational Inference, ELBO, and Tractable Approximation
variational-inference, ELBO, posterior-approximation, latent-variables, inference
1 Application Snapshot
Some inference problems are hard for a simple reason:
the posterior distribution you want is too expensive to compute exactly
That happens quickly in latent-variable models, hierarchical Bayesian models, and modern generative models.
Variational inference responds by changing the task:
- do not compute the exact posterior directly
- choose a tractable family of approximate posteriors
- optimize the best approximation inside that family
This is why approximate inference so often turns back into optimization.
2 Problem Setting
Suppose a model has latent variables \(z\) and observations \(x\):
\[ p_\theta(x,z)=p(z)p_\theta(x\mid z). \]
The exact posterior
\[ p_\theta(z\mid x) \]
is often the object we want, but it may be intractable because evaluating or normalizing it requires the marginal likelihood
\[ p_\theta(x)=\int p_\theta(x,z)\,dz. \]
Variational inference introduces a tractable approximation
\[ q_\phi(z\mid x), \]
then asks:
which tractable approximate posterior is closest to the true one while still being computationally manageable?
3 Why This Math Appears
This page sits on top of several site modules:
Statistics: posterior inference and latent-variable modelingOptimization: turning approximation quality into an objectiveInformation Theory: KL divergence, mutual information, and tractable boundsSignal Processing and Estimation: hidden variables and belief summariesMachine Learning: VAEs, representation learning, and amortized inference
So variational inference is one of the clearest places where probability language and optimization language become the same workflow.
4 Math Objects In Use
- latent variable \(z\)
- observed data \(x\)
- model parameters \(\theta\)
- variational parameters \(\phi\)
- approximate posterior \(q_\phi(z\mid x)\)
- KL divergence between approximate and true distributions
- ELBO as a computable surrogate objective
The structural point is simple:
- exact posterior inference is the ideal target
- ELBO is the tractable surrogate
- optimization over \(\theta\) and \(\phi\) is the computational engine
5 A Small Worked Walkthrough
Suppose you want to model images using a latent representation \(z\).
The ideal object is the posterior
\[ p_\theta(z\mid x), \]
because it tells you which latent explanations make a given image plausible.
But directly computing it may be too expensive.
So you introduce an encoder-like approximation
\[ q_\phi(z\mid x), \]
and optimize the ELBO:
\[ \mathcal{L}(x;\theta,\phi) = \mathbb{E}_{q_\phi(z\mid x)}[\log p_\theta(x\mid z)] - KL(q_\phi(z\mid x)\|p(z)). \]
Now each term gets an application-facing interpretation:
- the reconstruction term rewards approximate posteriors that keep enough information to explain the data well
- the KL term penalizes approximate posteriors that drift too far from the reference prior
So the optimization is balancing:
- fidelity to observed data
- simplicity or compression of the latent representation
That is why ELBO-based training feels like both approximate Bayesian inference and structured representation learning.
6 Implementation or Computation Note
The main computational choices here are:
Variational family choiceHow simple should the approximate posterior family be?Objective choiceAre you optimizing the standard ELBO, a weighted variant, or another tractable bound?Approximation biasWhat posterior structure are you giving up in exchange for tractability?
Strong next bridges already live on the site:
7 Failure Modes
- treating the ELBO as if it were the true posterior objective rather than a surrogate
- forgetting that the variational family itself can impose strong approximation bias
- reading the KL term only as “regularization” and losing its approximation meaning
- confusing a tractable bound with an exact Bayesian answer
- using amortized inference machinery without first identifying what posterior is being approximated
8 Paper Bridge
- CS236 Lecture 5 -
First pass- useful once latent-variable models and variational inference become concrete modeling tools. Checked2026-04-26. - CS236 Lecture 6 -
Bridge to modern generative models- useful once ELBO terms start to be read as design choices rather than just formulas. Checked2026-04-26.
9 Sources and Further Reading
- CS236 / Deep Generative Models -
First pass- official Stanford course anchor for latent-variable modeling and variational inference. Checked2026-04-26. - CS236 Lecture 5 -
First pass- official Stanford slides on variational inference and the ELBO. Checked2026-04-26. - CS236 Lecture 6 -
Second pass- official Stanford slides on deeper variational-modeling choices. Checked2026-04-26. - Auto-Encoding Variational Bayes -
Primary source- foundational paper for the VAE and ELBO training story. Checked2026-04-26. - Deep Variational Information Bottleneck -
Bridge to representation learning- useful for the compression-style interpretation of KL penalties. Checked2026-04-26.