Numerical Least Squares and Regularization

How overdetermined fitting problems become projection problems, why QR and SVD are safer computational viewpoints than blindly squaring the condition number through normal equations, and how regularization stabilizes ill-posed least-squares problems.
Modified

April 26, 2026

Keywords

least squares, QR, SVD, regularization, Tikhonov

1 Role

This is the fourth page of the Numerical Methods module.

Its job is to explain how fitting problems of the form

\[ \min_x \|Ax-b\|_2 \]

turn into projection and factorization problems, and why regularization becomes necessary when the fitting problem is ill-conditioned or ill-posed.

2 First-Pass Promise

Read this page after Iterative Methods and Preconditioning.

If you stop here, you should still understand:

  • why least squares is about minimizing residual norm, not solving Ax=b exactly
  • why QR and SVD are the right main computational viewpoints
  • why normal equations can make conditioning worse
  • why regularization stabilizes least-squares problems with weak or ambiguous information

3 Why It Matters

Least squares is one of the central computational forms in modern applied math.

It appears in:

  • linear regression
  • data fitting and inverse problems
  • overdetermined models from experiments
  • local quadratic approximations in optimization
  • many statistical and machine-learning pipelines

But the computational story is not just:

write down the normal equations and solve

That route can be fragile when the columns of A are nearly dependent or the data is noisy.

Numerical methods asks better questions:

  • what geometry does the residual minimization problem have?
  • what factorization preserves that geometry best?
  • how bad is the conditioning?
  • when should we regularize instead of chasing an unstable exact fit?

4 Prerequisite Recall

  • QR factorization is the orthogonality-friendly factorization from the previous linear-systems page
  • SVD gives a spectral view of matrix action and near-rank-deficiency
  • conditioning explains why small perturbations can create large changes in the fitted coefficients
  • regularization already appeared elsewhere on the site as a structural or stabilizing idea

5 Intuition

5.1 Least Squares As Projection

If the system is overdetermined, the vector b usually does not lie exactly in the column space of A.

So instead of solving Ax=b exactly, we look for the point Ax in the column space that is closest to b.

That is why least squares is a projection problem.

5.2 Why QR Is Better Than Blindly Squaring The Problem

The normal equations are

\[ A^TAx=A^Tb. \]

They are mathematically correct, but numerically they can be dangerous because the condition number is effectively squared.

QR avoids that by keeping the geometry orthogonal instead of collapsing it into A^TA.

5.3 Why SVD Matters

SVD shows which directions in the data are strong and which are weak.

Small singular values correspond to directions where coefficient estimates become unstable.

That is the numerical heart of ill-posedness in least squares.

5.4 Why Regularization Enters

If many coefficient vectors fit almost equally well, then small perturbations in the data can cause large swings in the solution.

Regularization says:

do not only fit the data; also prefer solutions with controlled size or structure

This trades a little bias for a lot of stability.

6 Formal Core

Definition 1 (Definition: Linear Least Squares) Given A \in \mathbb R^{m\times n} and b \in \mathbb R^m, the least-squares problem is

\[ \min_x \|Ax-b\|_2^2. \]

The goal is to minimize residual norm, not necessarily to make the residual zero.

Theorem 1 (Theorem Idea: Least Squares Means Orthogonal Projection) At a least-squares solution, the residual is orthogonal to the column space of A.

Equivalently,

\[ A^T(Ax-b)=0. \]

These are the normal equations.

Theorem 2 (Theorem Idea: Normal Equations Are Exact But Can Be Numerically Risky) The normal equations reduce least squares to a square linear system in A^TA, but this can worsen conditioning because

\[ \kappa(A^TA)\approx \kappa(A)^2 \]

in the Euclidean setting.

So the normal equations are often a poor first computational method when conditioning is already a concern.

Definition 2 (Definition: QR View Of Least Squares) If

\[ A=QR \]

with Q having orthonormal columns, then the least-squares problem becomes

\[ \min_x \|Rx-Q^Tb\|_2. \]

This is why QR is a standard stable route for dense least-squares problems.

Definition 3 (Definition: Tikhonov Or Ridge Regularization) A standard regularized least-squares problem is

\[ \min_x \|Ax-b\|_2^2 + \lambda \|x\|_2^2, \]

where \lambda > 0 controls the strength of regularization.

Regularization damps unstable directions and prefers smaller-norm solutions.

7 A Small Worked Example

Consider

\[ A= \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix}, \qquad b= \begin{bmatrix} 1 \\ 2 \\ 2 \end{bmatrix}. \]

This is an overdetermined system: one parameter, three equations.

We seek the scalar x minimizing

\[ \|Ax-b\|_2^2 = (x-1)^2+(2x-2)^2+(3x-2)^2. \]

The normal equation is

\[ A^TAx=A^Tb. \]

Here,

\[ A^TA = 1^2+2^2+3^2 = 14, \qquad A^Tb = 1\cdot 1 + 2\cdot 2 + 3\cdot 2 = 11. \]

So the least-squares solution is

\[ x=\frac{11}{14}. \]

The fitted vector is

\[ Ax= \begin{bmatrix} 11/14 \\ 22/14 \\ 33/14 \end{bmatrix}, \]

which is the projection of b onto the line spanned by the column of A.

Now imagine the columns of A were nearly dependent in a multivariate problem. Then small perturbations in b or A could move the coefficient vector much more dramatically, even if the fitted residual stays small. That is the setting where QR, SVD, and regularization become essential rather than optional.

8 Computation Lens

When you face a least-squares problem, ask:

  1. is this problem well-conditioned, or are the columns nearly dependent?
  2. should I think in terms of projection geometry, QR, or SVD rather than normal equations?
  3. do I care about a stable coefficient vector, a good prediction fit, or both?
  4. is regularization stabilizing an ambiguous problem, or encoding real prior structure I want to keep?

Those questions usually matter more than whether the objective function looks simple on paper.

9 Application Lens

9.1 Statistics And Regression

Least squares is the computational core behind linear regression, but numerical conditioning decides whether coefficient estimates are trustworthy.

9.2 Inverse Problems

Many inverse problems are least-squares problems with weak information in some directions, which is exactly why regularization becomes central.

9.3 Optimization

Quadratic models, Gauss-Newton methods, and many local approximation schemes repeatedly solve least-squares subproblems.

10 Stop Here For First Pass

If you can now explain:

  • why least squares is a projection problem
  • why QR or SVD is usually a better numerical viewpoint than blindly using normal equations
  • why weak singular directions create instability
  • why regularization trades a little bias for a more stable and meaningful solution

then this page has done its job.

11 Go Deeper

After this page, the next natural step is:

The strongest adjacent pages are:

12 Optional Deeper Reading After First Pass

The strongest current references connected to this page are:

13 Sources and Further Reading

  • Cornell CS4220: least squares and regularization - First pass - official current notes on ill-posedness and regularization in least squares. Checked 2026-04-25.
  • Cornell CS4220 schedule - First pass - official current schedule placing least squares and regularization in the core numerical-analysis progression. Checked 2026-04-25.
  • MIT 18.085 least squares lecture - First pass - official MIT lecture notes for QR-based least-squares computation. Checked 2026-04-25.
  • Cornell CS6210 least squares notes - Second pass - official current notes connecting least squares to matrix-computation and statistical viewpoints. Checked 2026-04-25.
  • Stanford CS137 syllabus - Second pass - official current syllabus connecting linear least squares, SVD, and regularization inside a scientific-computing sequence. Checked 2026-04-25.
  • MIT 18.335J resource index - Second pass - official resource map showing least squares and generalized SVD as part of the course arc. Checked 2026-04-25.
Back to top