Generalization in Modern Regimes

How interpolation, overparameterization, implicit bias, and data geometry changed the way learning theory talks about generalization.
Modified

April 26, 2026

Keywords

interpolation, double descent, benign overfitting, implicit bias, overparameterization

1 Role

This is the seventh page of the Learning Theory module.

The earlier pages built the classical first-pass spine:

  • ERM and risk
  • PAC language
  • VC and Rademacher capacity
  • stability and regularization

This page explains what changed when modern ML started routinely operating in regimes with:

  • zero or near-zero training error
  • many more parameters than samples
  • strong optimization effects
  • generalization behavior that classical worst-case stories do not fully explain by themselves

2 First-Pass Promise

Read this page after Algorithmic Stability and Regularization.

If you stop here, you should still understand:

  • what people mean by modern regimes
  • why interpolation and overparameterization became central
  • what double descent, benign overfitting, and implicit bias are trying to explain
  • why the modern picture uses classical theory as a base, but not as the whole story

3 Why It Matters

Classical learning theory often studies the question:

how can a not-too-rich class generalize from finite data?

Modern practice created a sharper puzzle:

how can highly overparameterized models fit the training data almost perfectly and still predict well?

That question pushed theory toward:

  • interpolation instead of small training error
  • optimizer-dependent bias instead of class capacity alone
  • data geometry and effective dimension instead of raw parameter count
  • phenomena like double descent and benign overfitting

This page is not about replacing classical theory. It is about seeing how the conversation widened.

4 Prerequisite Recall

  • ERM chooses a predictor from data
  • VC/Rademacher bounds control richness of classes
  • stability studies sensitivity of the learning procedure
  • optimization can create implicit preferences even without explicit penalties
  • regularization can be explicit or algorithmic

5 Intuition

5.1 Interpolation

In many modern settings, the learned model fits the training data almost exactly:

\[ \widehat R_n(\hat h)\approx 0. \]

Classically, this would sound dangerous, especially under noise.

Yet in overparameterized models, perfect or near-perfect fitting can still coexist with respectable test performance.

5.2 Overparameterization

Having more parameters than samples does not automatically imply bad generalization.

What matters more is often:

  • which solution among many interpolating ones is selected
  • how the data are distributed in feature space
  • what geometry or norm the optimizer implicitly favors

5.3 No Single Modern Theorem

The modern-regimes picture is not one theorem replacing VC theory.

It is a cluster of ideas:

  • double descent: test error can rise near interpolation and then fall again
  • benign overfitting: exact fitting of noisy data can still generalize well
  • implicit bias: optimization selects special solutions among many possibilities
  • effective complexity: parameter count alone is too crude

6 Formal Core

This page stays deliberately first-pass and language-level rather than turning into a catalog of sharp modern theorems.

Definition 1 (Definition: Interpolation) A learning procedure interpolates the training data if it achieves essentially zero training error, often exactly

\[ \widehat R_n(\hat h)=0 \]

in the chosen loss.

Definition 2 (Definition: Double Descent) Double descent refers to the phenomenon that test error can follow a curve with:

  1. an initial classical descent as model size grows
  2. a peak near the interpolation threshold
  3. a second descent in more overparameterized regimes

Definition 3 (Definition: Benign Overfitting) Benign overfitting means that a model can fit noisy training data exactly or nearly exactly while still achieving low test error.

Definition 4 (Definition: Implicit Bias) Implicit bias is the preference induced by the optimization method itself, even when no explicit regularizer is written into the objective.

Theorem 1 (Theorem-Level Message) The first-pass modern lesson is not a single universal bound. It is that generalization can depend on an interaction among:

  • the hypothesis class
  • the data distribution
  • the optimization procedure
  • the specific interpolating or near-interpolating solution selected

So the right explanatory object is often more refined than parameter count alone.

7 Worked Example

Consider overparameterized linear regression with more parameters than samples.

There are often infinitely many vectors \(w\) that interpolate the training data:

\[ Xw = y. \]

So the real question is not just:

does interpolation happen?

but rather:

which interpolating solution is chosen?

Gradient descent started from zero tends to select the minimum-norm interpolating solution. In some data regimes, that selected solution can still predict well on fresh data.

That simple example already shows the modern shift:

  • the hypothesis space is huge
  • exact fitting is possible
  • the optimizer chooses one special solution
  • generalization depends on that solution’s geometry and the data covariance, not just on raw model size

This is the kind of story behind many modern-regime results.

8 Computation Lens

Modern regimes force theory to pay attention to the training algorithm itself.

The questions become:

  • what solution does gradient descent find?
  • how do initialization and step size matter?
  • what norm, margin, or geometry is implicitly favored?
  • what happens once interpolation is reached?

This is why modern generalization theory often sits at the intersection of:

  • optimization
  • probability
  • linear algebra
  • classical statistical learning theory

9 Application Lens

9.1 Double Descent

Double descent changed the public story around capacity. It showed that test error can behave non-monotonically across the interpolation threshold, so the old “bigger model means worse generalization” slogan is too crude.

9.2 Benign Overfitting

Benign overfitting tells us that fitting noisy data is not automatically catastrophic. Whether it is harmful depends on the selected interpolating solution and the data geometry.

9.3 Implicit Bias

Implicit bias explains why optimization is not just a computational detail. The optimizer can act like a hidden regularizer.

9.4 Reading Modern Theory Papers

When a paper in this area claims a generalization result, good first questions are:

  • what exactly is overparameterized?
  • what is interpolating?
  • what solution does the algorithm converge to?
  • what notion of effective complexity replaces raw parameter count?

10 Stop Here For First Pass

If you can now explain:

  • why modern regimes created a new generalization puzzle
  • what double descent, benign overfitting, and implicit bias are trying to capture
  • why parameter count alone is often too crude
  • why optimization and data geometry now sit inside the generalization story

then this page has done its job.

11 Go Deeper

After this page, the next natural directions are:

The current best adjacent live pages are:

12 Optional Deeper Reading After First Pass

The strongest current references connected to this page are:

13 Sources and Further Reading

Back to top