ERM, Population Risk, and Hypothesis Classes
empirical risk minimization, population risk, hypothesis class, generalization, loss function
1 Role
This is the first page of the Learning Theory module.
Its job is to formalize the learning problem before the module starts proving guarantees.
A lot of confusion in theory-heavy ML comes from mixing together:
- the sample you observed
- the distribution you actually care about
- the loss you are minimizing
- the function class you allowed yourself to search over
This page separates those pieces cleanly.
2 First-Pass Promise
Read this page first in the module.
If you stop here, you should still understand:
- what supervised learning is as a mathematical object
- what a hypothesis class is
- what empirical risk and population risk are
- why ERM is the natural starting point for generalization theory
- why low training error alone is not enough
3 Why It Matters
Machine learning practice often starts with:
- choose a model family
- choose a loss
- fit on data
- evaluate on held-out data
Learning theory asks what that process means mathematically.
The central question is not:
did optimization reduce the training loss?
The central question is:
did the predictor we found achieve low risk under the data-generating distribution?
That one step forces the whole theory stack to appear:
- probability because the sample is random
- statistics because the loss estimates an unknown target quantity
- optimization because ERM is an optimization problem
- complexity theory of function classes because too much flexibility breaks naive guarantees
4 Prerequisite Recall
- probability gives you a data-generating distribution and random samples
- statistics already introduced estimators, validation, and the difference between sample quantities and population quantities
- optimization already framed fitting as minimizing an objective over a feasible set
5 Intuition
5.1 The Data Comes From A Distribution
In learning theory, examples are usually modeled as i.i.d. draws
\[ (X,Y) \sim P \]
from an unknown distribution \(P\) over inputs and labels.
You do not directly optimize against \(P\), because you do not know it. You only see a finite i.i.d. sample from it.
5.2 Hypothesis Class
A hypothesis class is the family of predictors you allow yourself to choose from.
Examples:
- all threshold classifiers on the line
- all linear classifiers in \(\mathbb{R}^d\)
- all predictors representable by some neural-network architecture
The class matters because learning is not just about one predictor. It is about selecting one predictor from a family using finite data.
5.3 Empirical Risk vs Population Risk
The population risk is what you actually care about:
\[ R(h)=\mathbb{E}[\ell(h(X),Y)]. \]
The empirical risk is what you can compute from data:
\[ \widehat{R}_n(h)=\frac{1}{n}\sum_{i=1}^n \ell(h(X_i),Y_i). \]
Learning theory begins by studying when minimizing \(\widehat{R}_n(h)\) is a good proxy for minimizing \(R(h)\).
6 Formal Core
Definition 1 (Definition: Learning Problem) Fix:
- an input space \(\mathcal{X}\)
- an output space \(\mathcal{Y}\)
- a distribution \(P\) on \(\mathcal{X}\times\mathcal{Y}\)
- a loss function \(\ell : \mathcal{Y}\times\mathcal{Y}\to\mathbb{R}_{\ge 0}\)
- a hypothesis class \(\mathcal{H}\) of predictors \(h : \mathcal{X}\to\mathcal{Y}\)
The goal is to find \(h \in \mathcal{H}\) with low population risk
\[ R(h)=\mathbb{E}_{(X,Y)\sim P}[\ell(h(X),Y)]. \]
Definition 2 (Definition: Empirical Risk Minimization) Given an i.i.d. sample
\[ S=\{(X_1,Y_1),\dots,(X_n,Y_n)\}, \]
with
\[ ((X_1,Y_1),\dots,(X_n,Y_n)) \sim P^n, \]
the empirical risk of \(h\) is
\[ \widehat{R}_n(h)=\frac{1}{n}\sum_{i=1}^n \ell(h(X_i),Y_i). \]
When the minimum is attained, an empirical risk minimizer is any hypothesis \(\hat{h}_n \in \mathcal{H}\) satisfying
\[ \hat{h}_n \in \arg\min_{h\in\mathcal{H}} \widehat{R}_n(h). \]
When no exact minimizer exists, one typically works with the infimum or an approximate minimizer instead.
This is the simplest mathematical model of fit the model on the training sample.
Definition 3 (Definition: Best-In-Class Predictor) When a minimizer exists, the population benchmark inside the class is
\[ h^\ast_{\mathcal{H}} \in \arg\min_{h\in\mathcal{H}} R(h). \]
Even if \(\mathcal{H}\) does not contain the true rule, this is the best predictor available inside the chosen class.
When no minimizer exists, one instead compares to the infimum of \(R(h)\) over \(h \in \mathcal{H}\).
The learning-theory problem is often to compare \(\hat{h}_n\) with \(h^\ast_{\mathcal{H}}\).
7 Worked Example
Consider binary classification on the real line with threshold predictors
\[ h_t(x)=\mathbf{1}\{x\ge t\}. \]
Here the hypothesis class is
\[ \mathcal{H}=\{h_t : t \in \mathbb{R}\}. \]
Suppose the loss is zero-one loss:
\[ \ell(h(x),y)=\mathbf{1}\{h(x)\ne y\}. \]
Then the population risk is
\[ R(h_t)=\mathbb{P}(h_t(X)\ne Y), \]
which is the true classification error under the unknown distribution.
From a sample
\[ S=\{(X_1,Y_1),\dots,(X_n,Y_n)\}, \]
the empirical risk is
\[ \widehat{R}_n(h_t)=\frac{1}{n}\sum_{i=1}^n \mathbf{1}\{h_t(X_i)\ne Y_i\}, \]
which is the training error.
ERM chooses the threshold that minimizes the observed training error over all thresholds.
This example already shows the central tension:
- you can minimize training error exactly over the sample
- but the real question is whether the chosen threshold also has low error under the distribution
That gap between empirical and population performance is where generalization theory starts.
8 Computation Lens
ERM is not only a theoretical definition. It is the abstract form behind many training procedures:
- least squares minimizes empirical squared loss
- logistic regression minimizes empirical log loss
- SVM-style methods minimize empirical surrogate losses plus regularization
- neural-network training usually minimizes empirical loss over a huge hypothesis class
So learning theory does not replace optimization. It asks when optimization on data produces a predictor that is actually useful beyond that data.
9 Application Lens
9.1 Generalization
This page gives the main vocabulary needed to understand generalization results:
- sample
- distribution
- class
- empirical risk
- population risk
Every later theorem is really a refinement of how those pieces relate.
9.2 Regularization And Model Class Design
Regularization can be read as changing the effective search problem:
- shrink the class
- penalize complexity inside the class
- trade training fit for better population behavior
That is why model complexity and regularization naturally belong in the same theory conversation.
9.3 Modern ML
In modern overparameterized ML, the class may be huge and the optimizer may find interpolating solutions. Learning theory then asks why the resulting predictor can still have low population risk.
10 Stop Here For First Pass
If you can now explain:
- what a hypothesis class is
- why empirical risk is different from population risk
- why ERM is the clean first mathematical model of fitting
- why low training error alone does not prove low test or population error
then this page has done its first-pass job.
11 Go Deeper
The next natural module step is:
For now, the best live next steps are:
12 Sources and Further Reading
- Stanford STATS214 / CS229M: Machine Learning Theory -
First pass- current official course page showing the modern theory arc from generalization to deep-learning theory. Checked2026-04-25. - Stanford CS229T Notes -
First pass- official notes that formalize the learning setup, ERM, and generalization questions very clearly. Checked2026-04-25. - Stanford CS229T / STATS231: Statistical Learning Theory -
First pass- official archived course page with a clean statistical learning theory backbone. Checked2026-04-25. - MIT 6.036: Introduction to Machine Learning -
Second pass- official course page that gives a practical bridge from fitting objectives to learning-theory language. Checked2026-04-25. - Berkeley CS281B / STAT241B: Statistical Learning Theory -
Second pass- strong official course page for deeper statistical learning theory. Checked2026-04-25.