Generalization in Modern Regimes

How interpolation, overparameterization, implicit bias, and data geometry changed the way learning theory talks about generalization.

Modified

April 26, 2026

Keywords

interpolation, double descent, benign overfitting, implicit bias, overparameterization

1 Role

This is the seventh page of the Learning Theory module.

The earlier pages built the classical first-pass spine:

ERM and risk
PAC language
VC and Rademacher capacity
stability and regularization

This page explains what changed when modern ML started routinely operating in regimes with:

zero or near-zero training error
many more parameters than samples
strong optimization effects
generalization behavior that classical worst-case stories do not fully explain by themselves

2 First-Pass Promise

Read this page after Algorithmic Stability and Regularization.

If you stop here, you should still understand:

what people mean by modern regimes
why interpolation and overparameterization became central
what double descent, benign overfitting, and implicit bias are trying to explain
why the modern picture uses classical theory as a base, but not as the whole story

3 Why It Matters

Classical learning theory often studies the question:

how can a not-too-rich class generalize from finite data?

Modern practice created a sharper puzzle:

how can highly overparameterized models fit the training data almost perfectly and still predict well?

That question pushed theory toward:

interpolation instead of small training error
optimizer-dependent bias instead of class capacity alone
data geometry and effective dimension instead of raw parameter count
phenomena like double descent and benign overfitting

This page is not about replacing classical theory. It is about seeing how the conversation widened.

4 Prerequisite Recall

ERM chooses a predictor from data
VC/Rademacher bounds control richness of classes
stability studies sensitivity of the learning procedure
optimization can create implicit preferences even without explicit penalties
regularization can be explicit or algorithmic

5 Intuition

5.1 Interpolation

In many modern settings, the learned model fits the training data almost exactly:

\[ \widehat R_n(\hat h)\approx 0. \]

Classically, this would sound dangerous, especially under noise.

Yet in overparameterized models, perfect or near-perfect fitting can still coexist with respectable test performance.

5.2 Overparameterization

Having more parameters than samples does not automatically imply bad generalization.

What matters more is often:

which solution among many interpolating ones is selected
how the data are distributed in feature space
what geometry or norm the optimizer implicitly favors

5.3 No Single Modern Theorem

The modern-regimes picture is not one theorem replacing VC theory.

It is a cluster of ideas:

double descent: test error can rise near interpolation and then fall again
benign overfitting: exact fitting of noisy data can still generalize well
implicit bias: optimization selects special solutions among many possibilities
effective complexity: parameter count alone is too crude

6 Formal Core

This page stays deliberately first-pass and language-level rather than turning into a catalog of sharp modern theorems.

Definition 1 (Definition: Interpolation) A learning procedure interpolates the training data if it achieves essentially zero training error, often exactly

\[ \widehat R_n(\hat h)=0 \]

in the chosen loss.

Definition 2 (Definition: Double Descent) Double descent refers to the phenomenon that test error can follow a curve with:

an initial classical descent as model size grows
a peak near the interpolation threshold
a second descent in more overparameterized regimes

Definition 3 (Definition: Benign Overfitting) Benign overfitting means that a model can fit noisy training data exactly or nearly exactly while still achieving low test error.

Definition 4 (Definition: Implicit Bias) Implicit bias is the preference induced by the optimization method itself, even when no explicit regularizer is written into the objective.

Theorem 1 (Theorem-Level Message) The first-pass modern lesson is not a single universal bound. It is that generalization can depend on an interaction among:

the hypothesis class
the data distribution
the optimization procedure
the specific interpolating or near-interpolating solution selected

So the right explanatory object is often more refined than parameter count alone.

7 Worked Example

Consider overparameterized linear regression with more parameters than samples.

There are often infinitely many vectors \(w\) that interpolate the training data:

\[ Xw = y. \]

So the real question is not just:

does interpolation happen?

but rather:

which interpolating solution is chosen?

Gradient descent started from zero tends to select the minimum-norm interpolating solution. In some data regimes, that selected solution can still predict well on fresh data.

That simple example already shows the modern shift:

the hypothesis space is huge
exact fitting is possible
the optimizer chooses one special solution
generalization depends on that solution’s geometry and the data covariance, not just on raw model size

This is the kind of story behind many modern-regime results.

8 Computation Lens

Modern regimes force theory to pay attention to the training algorithm itself.

The questions become:

what solution does gradient descent find?
how do initialization and step size matter?
what norm, margin, or geometry is implicitly favored?
what happens once interpolation is reached?

This is why modern generalization theory often sits at the intersection of:

optimization
probability
linear algebra
classical statistical learning theory

9 Application Lens

9.1 Double Descent

Double descent changed the public story around capacity. It showed that test error can behave non-monotonically across the interpolation threshold, so the old “bigger model means worse generalization” slogan is too crude.

9.2 Benign Overfitting

Benign overfitting tells us that fitting noisy data is not automatically catastrophic. Whether it is harmful depends on the selected interpolating solution and the data geometry.

9.3 Implicit Bias

Implicit bias explains why optimization is not just a computational detail. The optimizer can act like a hidden regularizer.

9.4 Reading Modern Theory Papers

When a paper in this area claims a generalization result, good first questions are:

what exactly is overparameterized?
what is interpolating?
what solution does the algorithm converge to?
what notion of effective complexity replaces raw parameter count?

10 Stop Here For First Pass

If you can now explain:

why modern regimes created a new generalization puzzle
what double descent, benign overfitting, and implicit bias are trying to capture
why parameter count alone is often too crude
why optimization and data geometry now sit inside the generalization story

then this page has done its job.

11 Go Deeper

After this page, the next natural directions are:

The current best adjacent live pages are:

12 Optional Deeper Reading After First Pass

The strongest current references connected to this page are:

Stanford STATS214 / CS229M: Machine Learning Theory - current official course page explicitly placing deep-learning theory and modern generalization inside the theory curriculum. Checked 2026-04-25.
Stanford CS229T Notes - official notes giving the classical baseline that modern-regime papers extend, challenge, or refine. Checked 2026-04-25.
Benign Overfitting in Linear Regression (PNAS / PMC) - classic primary-source entry into the benign-overfitting literature. Checked 2026-04-25.
The Implicit Bias of Benign Overfitting (JMLR 2023) - modern primary-source reference on when benign overfitting can or cannot arise. Checked 2026-04-25.
Double Trouble in Double Descent (PMLR 2020) - paper-bridge source for the geometry behind double-descent behavior. Checked 2026-04-25.

13 Sources and Further Reading

Stanford STATS214 / CS229M: Machine Learning Theory - First pass - current official course page for the modern theory arc. Checked 2026-04-25.
Stanford CS229T Notes - First pass - official notes for the classical baseline that modern-regime theory builds on. Checked 2026-04-25.
Benign Overfitting in Linear Regression (PNAS / PMC) - Second pass - primary-source entry into interpolation-era theory. Checked 2026-04-25.
The Implicit Bias of Benign Overfitting (JMLR 2023) - Second pass - modern paper on the interplay between benign overfitting and algorithmic bias. Checked 2026-04-25.
Double Trouble in Double Descent (PMLR 2020) - Paper bridge - useful bridge source for the modern double-descent conversation. Checked 2026-04-25.