Numerical Least Squares and Regularization
least squares, QR, SVD, regularization, Tikhonov
1 Role
This is the fourth page of the Numerical Methods module.
Its job is to explain how fitting problems of the form
\[ \min_x \|Ax-b\|_2 \]
turn into projection and factorization problems, and why regularization becomes necessary when the fitting problem is ill-conditioned or ill-posed.
2 First-Pass Promise
Read this page after Iterative Methods and Preconditioning.
If you stop here, you should still understand:
- why least squares is about minimizing residual norm, not solving
Ax=bexactly - why QR and SVD are the right main computational viewpoints
- why normal equations can make conditioning worse
- why regularization stabilizes least-squares problems with weak or ambiguous information
3 Why It Matters
Least squares is one of the central computational forms in modern applied math.
It appears in:
- linear regression
- data fitting and inverse problems
- overdetermined models from experiments
- local quadratic approximations in optimization
- many statistical and machine-learning pipelines
But the computational story is not just:
write down the normal equations and solve
That route can be fragile when the columns of A are nearly dependent or the data is noisy.
Numerical methods asks better questions:
- what geometry does the residual minimization problem have?
- what factorization preserves that geometry best?
- how bad is the conditioning?
- when should we regularize instead of chasing an unstable exact fit?
4 Prerequisite Recall
- QR factorization is the orthogonality-friendly factorization from the previous linear-systems page
- SVD gives a spectral view of matrix action and near-rank-deficiency
- conditioning explains why small perturbations can create large changes in the fitted coefficients
- regularization already appeared elsewhere on the site as a structural or stabilizing idea
5 Intuition
5.1 Least Squares As Projection
If the system is overdetermined, the vector b usually does not lie exactly in the column space of A.
So instead of solving Ax=b exactly, we look for the point Ax in the column space that is closest to b.
That is why least squares is a projection problem.
5.2 Why QR Is Better Than Blindly Squaring The Problem
The normal equations are
\[ A^TAx=A^Tb. \]
They are mathematically correct, but numerically they can be dangerous because the condition number is effectively squared.
QR avoids that by keeping the geometry orthogonal instead of collapsing it into A^TA.
5.3 Why SVD Matters
SVD shows which directions in the data are strong and which are weak.
Small singular values correspond to directions where coefficient estimates become unstable.
That is the numerical heart of ill-posedness in least squares.
5.4 Why Regularization Enters
If many coefficient vectors fit almost equally well, then small perturbations in the data can cause large swings in the solution.
Regularization says:
do not only fit the data; also prefer solutions with controlled size or structure
This trades a little bias for a lot of stability.
6 Formal Core
Definition 1 (Definition: Linear Least Squares) Given A \in \mathbb R^{m\times n} and b \in \mathbb R^m, the least-squares problem is
\[ \min_x \|Ax-b\|_2^2. \]
The goal is to minimize residual norm, not necessarily to make the residual zero.
Theorem 1 (Theorem Idea: Least Squares Means Orthogonal Projection) At a least-squares solution, the residual is orthogonal to the column space of A.
Equivalently,
\[ A^T(Ax-b)=0. \]
These are the normal equations.
Theorem 2 (Theorem Idea: Normal Equations Are Exact But Can Be Numerically Risky) The normal equations reduce least squares to a square linear system in A^TA, but this can worsen conditioning because
\[ \kappa(A^TA)\approx \kappa(A)^2 \]
in the Euclidean setting.
So the normal equations are often a poor first computational method when conditioning is already a concern.
Definition 2 (Definition: QR View Of Least Squares) If
\[ A=QR \]
with Q having orthonormal columns, then the least-squares problem becomes
\[ \min_x \|Rx-Q^Tb\|_2. \]
This is why QR is a standard stable route for dense least-squares problems.
Definition 3 (Definition: Tikhonov Or Ridge Regularization) A standard regularized least-squares problem is
\[ \min_x \|Ax-b\|_2^2 + \lambda \|x\|_2^2, \]
where \lambda > 0 controls the strength of regularization.
Regularization damps unstable directions and prefers smaller-norm solutions.
7 A Small Worked Example
Consider
\[ A= \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix}, \qquad b= \begin{bmatrix} 1 \\ 2 \\ 2 \end{bmatrix}. \]
This is an overdetermined system: one parameter, three equations.
We seek the scalar x minimizing
\[ \|Ax-b\|_2^2 = (x-1)^2+(2x-2)^2+(3x-2)^2. \]
The normal equation is
\[ A^TAx=A^Tb. \]
Here,
\[ A^TA = 1^2+2^2+3^2 = 14, \qquad A^Tb = 1\cdot 1 + 2\cdot 2 + 3\cdot 2 = 11. \]
So the least-squares solution is
\[ x=\frac{11}{14}. \]
The fitted vector is
\[ Ax= \begin{bmatrix} 11/14 \\ 22/14 \\ 33/14 \end{bmatrix}, \]
which is the projection of b onto the line spanned by the column of A.
Now imagine the columns of A were nearly dependent in a multivariate problem. Then small perturbations in b or A could move the coefficient vector much more dramatically, even if the fitted residual stays small. That is the setting where QR, SVD, and regularization become essential rather than optional.
8 Computation Lens
When you face a least-squares problem, ask:
- is this problem well-conditioned, or are the columns nearly dependent?
- should I think in terms of projection geometry, QR, or SVD rather than normal equations?
- do I care about a stable coefficient vector, a good prediction fit, or both?
- is regularization stabilizing an ambiguous problem, or encoding real prior structure I want to keep?
Those questions usually matter more than whether the objective function looks simple on paper.
9 Application Lens
9.1 Statistics And Regression
Least squares is the computational core behind linear regression, but numerical conditioning decides whether coefficient estimates are trustworthy.
9.2 Inverse Problems
Many inverse problems are least-squares problems with weak information in some directions, which is exactly why regularization becomes central.
9.3 Optimization
Quadratic models, Gauss-Newton methods, and many local approximation schemes repeatedly solve least-squares subproblems.
10 Stop Here For First Pass
If you can now explain:
- why least squares is a projection problem
- why QR or SVD is usually a better numerical viewpoint than blindly using normal equations
- why weak singular directions create instability
- why regularization trades a little bias for a more stable and meaningful solution
then this page has done its job.
11 Go Deeper
After this page, the next natural step is:
The strongest adjacent pages are:
12 Optional Deeper Reading After First Pass
The strongest current references connected to this page are:
- Cornell CS4220: least squares and regularization - official current notes on ill-posedness and regularization in least squares. Checked
2026-04-25. - Cornell CS4220 schedule - official current schedule showing the least-squares to regularization progression. Checked
2026-04-25. - MIT 18.085 least squares lecture - official MIT lecture notes with QR-based least-squares computation. Checked
2026-04-25. - Cornell CS6210 least squares notes - official current matrix-computations notes connecting least squares, residual geometry, and factorization choices. Checked
2026-04-25. - Stanford CS137 syllabus - official current syllabus showing the standard numerical route through linear least squares, SVD, and regularization. Checked
2026-04-25. - MIT 18.335J resource index - official MIT course resource map showing linear regression and generalized SVD as part of the numerical-methods storyline. Checked
2026-04-25.
13 Sources and Further Reading
- Cornell CS4220: least squares and regularization -
First pass- official current notes on ill-posedness and regularization in least squares. Checked2026-04-25. - Cornell CS4220 schedule -
First pass- official current schedule placing least squares and regularization in the core numerical-analysis progression. Checked2026-04-25. - MIT 18.085 least squares lecture -
First pass- official MIT lecture notes for QR-based least-squares computation. Checked2026-04-25. - Cornell CS6210 least squares notes -
Second pass- official current notes connecting least squares to matrix-computation and statistical viewpoints. Checked2026-04-25. - Stanford CS137 syllabus -
Second pass- official current syllabus connecting linear least squares, SVD, and regularization inside a scientific-computing sequence. Checked2026-04-25. - MIT 18.335J resource index -
Second pass- official resource map showing least squares and generalized SVD as part of the course arc. Checked2026-04-25.