PAC Learning, Sample Complexity, and the Learning Setup
PAC learning, sample complexity, realizable case, agnostic learning, confidence
1 Role
This is the second page of the Learning Theory module.
The first page defined the learning problem through data distributions, losses, empirical risk, and hypothesis classes.
This page asks the next natural question:
what does it mean to learn reliably from finitely many samples?
That is the job of the PAC viewpoint and the language of sample complexity.
2 First-Pass Promise
Read this page after ERM, Population Risk, and Hypothesis Classes.
If you stop here, you should still understand:
- what
probably approximately correctis trying to formalize - what sample complexity measures
- why \(\varepsilon\) and \(\delta\) are the two main knobs in many learning guarantees
- why finite-sample guarantees depend on both the class and the learning setup
3 Why It Matters
Learning theory is not satisfied with saying:
this method often works in practice
It wants a statement of the form:
with high probability over the training sample, the learned predictor has small population error
That sentence has three important pieces:
small error: the approximation targethigh probability: the confidence levelover the training sample: the source of randomness in the guarantee
PAC learning packages those pieces into a clean framework.
This matters because once you understand PAC language, many theorem statements in ML become much less intimidating. They are often variations on the same core question:
how many samples are enough to guarantee good out-of-sample performance?
4 Prerequisite Recall
- empirical risk lives on the observed sample
- population risk lives under the unknown data distribution
- the training sample is usually modeled as i.i.d.
- ERM chooses a predictor by minimizing empirical risk over a class
5 Intuition
5.1 The Two Guarantee Knobs
PAC-style guarantees are built around two parameters:
- \(\varepsilon\): how accurate you want the predictor to be
- \(\delta\): how much failure probability you are willing to tolerate
Smaller \(\varepsilon\) means a tighter accuracy target. Smaller \(\delta\) means a stronger confidence demand.
Both usually require more data.
5.2 Sample Complexity
The sample complexity asks:
how large must n be so that learning succeeds at accuracy ε and confidence 1−δ?
It is not one universal number. It depends on:
- the hypothesis class
- the loss/setup
- the algorithm
- whether the world is realizable or agnostic
5.3 Realizable vs Agnostic
In the realizable setting, you assume the class contains a perfect predictor for the task.
In the agnostic setting, you do not. Then the goal is to compete with the best predictor in the class, not with perfection.
This distinction is one of the first big forks in learning theory.
6 Formal Core
To keep the first pass concrete, the formal statements below specialize to binary classification with zero-one loss. That is the cleanest setting for first learning the PAC language.
Definition 1 (Definition Idea: PAC Guarantee In The Realizable Setting) A learner is probably approximately correct in the realizable setting if, for every distribution \(P\) that is realizable by the class \(\mathcal{H}\), it returns a predictor \(\hat h\) such that with probability at least \(1-\delta\) over the i.i.d. training sample,
\[ R(\hat h)\le \varepsilon, \]
once the sample size is large enough.
Here:
probablyrefers to confidence at least \(1-\delta\)approximately correctrefers to error at most \(\varepsilon\)- the randomness is over the training sample
Definition 2 (Agnostic Comparison Target) In the agnostic setting, perfection is not assumed.
Instead, the usual target is to return a predictor \(\hat h\) such that, with high probability,
\[ R(\hat h)\le \inf_{h\in\mathcal H}R(h) + \varepsilon. \]
So the learner competes with the best predictor available inside the class, up to an additive tolerance \(\varepsilon\).
Definition 3 (Definition: Sample Complexity) The sample complexity is the number of training examples needed so that the PAC-style guarantee holds for chosen values of \(\varepsilon\) and \(\delta\).
So sample complexity is the first formal answer to:
how much data do I need?
Definition 4 (Realizable vs Agnostic Setup)
Realizable: in binary classification with zero-one loss, there exists some \(h \in \mathcal{H}\) with zero population classification error.Agnostic: no perfect hypothesis is assumed; the goal is to compete with the best predictor in \(\mathcal{H}\).
Theorem 1 (Theorem Idea: Finite-Class Sample Complexity) For a finite hypothesis class \(\mathcal{H}\) in realizable binary classification with zero-one loss, ERM succeeds once the sample size is on the order of
\[ \frac{\log |\mathcal{H}| + \log(1/\delta)}{\varepsilon}. \]
This theorem says that:
- larger classes need more data
- stronger confidence needs more data
- tighter error tolerance needs more data
The exact constants are less important on a first pass than the structural dependence.
7 Worked Example
Suppose your hypothesis class contains only four candidate classifiers:
\[ \mathcal{H}=\{h_1,h_2,h_3,h_4\}. \]
Assume the realizable setting: one of these classifiers has zero population error.
Then a standard finite-class PAC bound says that ERM needs roughly
\[ n \gtrsim \frac{\log |\mathcal{H}| + \log(1/\delta)}{\varepsilon} \]
samples.
Since \(|\mathcal{H}|=4\),
\[ \log |\mathcal{H}| = \log 4, \]
so the dependence on class size is mild here.
But if the class had size \(4{,}000{,}000\), then \(\log |\mathcal{H}|\) would be much larger, and the needed sample size would increase accordingly.
The qualitative lesson is more important than the exact arithmetic:
- a richer class gives you more expressive power
- but that expressive power costs sample complexity
This is the first clean form of the bias-complexity tension that later pages make sharper.
8 Computation Lens
PAC theory separates two questions that practice often mixes together:
- can we optimize the training objective?
- if we can, should we expect the learned predictor to generalize?
Optimization answers the first question. Learning theory answers the second.
That is why a model can be easy to fit and still hard to justify statistically.
9 Application Lens
9.1 Generalization And Data Requirements
PAC language is the cleanest first answer to:
- how much data is enough?
- how should confidence be reported?
- why does a richer class demand stronger evidence?
9.2 Model Selection
When practitioners say that a model class is too flexible, PAC/sample-complexity language is one of the first formal ways to explain what that means.
9.3 Modern ML
Modern deep learning often stretches or escapes classical PAC intuitions, but the PAC framework is still one of the main starting points for understanding what later theory is trying to revise, refine, or replace.
10 Stop Here For First Pass
If you can now explain:
- what
probablyandapproximatelyrefer to in a PAC statement - what sample complexity is measuring
- why \(\varepsilon\) and \(\delta\) both matter
- why finite-class guarantees get worse as the class gets larger
- why realizable and agnostic settings are not the same problem
then this page has done its first-pass job.
11 Go Deeper
The next natural module step is:
For now, the best live next steps are:
12 Sources and Further Reading
- Stanford STATS214 / CS229M: Machine Learning Theory -
First pass- current official theory course page that frames generalization and modern learning-theory questions clearly. Checked2026-04-25. - Stanford CS229T Notes -
First pass- official notes with PAC definitions, finite-class sample complexity, and the realizable/agnostic split. Checked2026-04-25. - Stanford CS229T / STATS231: Statistical Learning Theory -
First pass- official archived course page supporting the classical PAC/VC route. Checked2026-04-25. - Berkeley CS281B / STAT241B: Statistical Learning Theory -
Second pass- strong official course page for deeper sample-complexity and capacity arguments. Checked2026-04-25.