Orthogonality and Least Squares

How orthogonality turns approximation into projection and makes linear regression, residual analysis, and numerical least squares fit together.

Modified

April 26, 2026

Keywords

orthogonality, projection, least squares, regression, normal equations

1 Role

This page is the bridge from geometric linear algebra to estimation, optimization, and numerical computation.

Orthogonality tells you what it means to be the best approximation inside a subspace. Least squares turns that geometric idea into one of the most reused tools in statistics, machine learning, signal processing, and scientific computing.

2 First-Pass Promise

Read this page after Subspaces, Basis, and Dimension.

If you stop here, you should still understand:

why least squares is a projection problem
what the normal equations say
how one complete worked example behaves
how the topic connects to regression and numerical computation

3 Why It Matters

This topic matters because it is one of the first places where the same mathematical object supports all of these at once:

a geometric statement: \(\text{best approximation} = \text{orthogonal projection}\)
a computational problem: solve an overdetermined system stably
a modeling problem: fit parameters to data
a research bridge: understand sketching, inverse problems, and overparameterized regression

4 Prerequisite Recall

the column space of a matrix \(A\) is the set of vectors of the form \(Ax\)
a vector \(r\) is orthogonal to a subspace \(W\) if it is orthogonal to every vector in \(W\)
projecting \(b\) onto a subspace means splitting it into \(b = p + r\), where \(p\) lies in the subspace and \(r\) is orthogonal to it

5 Intuition

When \(Ax = b\) has no exact solution, the vectors you can actually produce are only the vectors in the column space of \(A\).

So the right question is no longer:

Can I hit b exactly?

It becomes:

Which vector in the column space of \(A\) gets closest to \(b\)?

That closest vector is the orthogonal projection of \(b\) onto \(\operatorname{col}(A)\).

The residual

\[ r = b - A\hat{x} \]

is not arbitrary noise. At the optimum it points exactly in the direction that the column space cannot explain. That is why the residual is orthogonal to every column of \(A\).

6 Formal Core

Definition 1 (Definition) For a matrix \(A \in \mathbb{R}^{m \times n}\) and data vector \(b \in \mathbb{R}^m\), the linear least-squares problem is

\[ \hat{x} \in \arg\min_{x \in \mathbb{R}^n} \|Ax - b\|_2^2. \]

The fitted vector is \(A\hat{x}\) and the residual is \(r = b - A\hat{x}\).

The key geometric fact is that \(A\hat{x}\) must be the orthogonal projection of \(b\) onto \(\operatorname{col}(A)\).

Proposition 1 (Projection Principle) A vector \(\hat{x}\) solves the least-squares problem if and only if the residual \(r = b - A\hat{x}\) is orthogonal to the column space of \(A\).

Equivalently,

\[ A^\top (b - A\hat{x}) = 0. \]

Rearranging gives the normal equations:

\[ A^\top A \hat{x} = A^\top b. \tag{1}\]

If the columns of \(A\) are linearly independent, then \(A^\top A\) is invertible and the least-squares solution is unique:

\[ \hat{x} = (A^\top A)^{-1} A^\top b. \]

This closed form is conceptually useful, but in practice it is usually better to solve least squares through QR or SVD, not by explicitly forming \((A^\top A)^{-1}\).

7 Worked Example

Fit a line \(y \approx \beta_0 + \beta_1 t\) through the three points (0,1), (1,2), (2,2).

Write this as \(A\beta \approx b\), with

\[ A = \begin{bmatrix} 1 & 0 \\ 1 & 1 \\ 1 & 2 \end{bmatrix}, \qquad \beta = \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix}, \qquad b = \begin{bmatrix} 1 \\ 2 \\ 2 \end{bmatrix}. \]

Then

\[ A^\top A = \begin{bmatrix} 3 & 3 \\ 3 & 5 \end{bmatrix}, \qquad A^\top b = \begin{bmatrix} 5 \\ 6 \end{bmatrix}. \]

Solving Equation 1 gives

\[ \hat{\beta} = \begin{bmatrix} 7/6 \\ 1/2 \end{bmatrix}. \]

So the least-squares line is

\[ \hat{y}(t) = \frac{7}{6} + \frac{1}{2} t. \]

The fitted values are

\[ A\hat{\beta} = \begin{bmatrix} 7/6 \\ 5/3 \\ 13/6 \end{bmatrix}, \qquad r = b - A\hat{\beta} = \begin{bmatrix} -1/6 \\ 1/3 \\ -1/6 \end{bmatrix}. \]

Now check the orthogonality conditions:

\[ \mathbf{1}^\top r = 0, \qquad \begin{bmatrix} 0 & 1 & 2 \end{bmatrix} r = 0. \]

The residual is orthogonal to both columns of \(A\), so it is orthogonal to the whole column space. That is exactly the projection principle at work.

8 Computation Lens

The mathematics says \(A^\top A \hat{x} = A^\top b\), but numerical linear algebra immediately asks a second question:

What is the stable way to solve it?

Three levels matter:

normal equations: conceptually clean, but can square the condition number
QR factorization: the standard stable method for many full-rank problems
SVD: more expensive, but the safest when rank deficiency or near-rank deficiency matters

This is why least squares is not only a modeling tool. It is also a gateway to numerical thinking: conditioning, orthogonalization, and low-rank structure all show up here.

9 Application Lens

In linear regression, the design matrix \(X\) plays the role of \(A\), the coefficient vector \(\beta\) plays the role of \(x\), and the target vector \(y\) plays the role of \(b\).

The fitted prediction \(X\hat{\beta}\) is the orthogonal projection of \(y\) onto the column space of \(X\).

That one sentence explains several practical facts:

residuals are orthogonal to every feature column
with an intercept column, residuals sum to zero
near-collinearity can make coefficient recovery and normal-equation solves unstable
overparameterized models push attention toward minimum-norm or regularized solutions

For a fuller application walkthrough, see Linear Regression Through Projection. For an interactive picture, see Computation Lab: Projection Geometry and Regression Residuals.

10 Common Mistakes

thinking the normal equations are automatically the best numerical algorithm
forgetting that \(A^\top A\) is invertible only when the columns of \(A\) are linearly independent
confusing projection onto the column space with projection onto the row space
treating the residual as an arbitrary error vector instead of a vector forced into the orthogonal complement
using least squares as a default model choice even when the data are highly nonlinear, badly contaminated, or structurally constrained

11 Exercises

Show that if the columns of \(A\) are orthonormal, then the least-squares solution is \(\hat{x} = A^\top b\).
In the worked example above, verify directly that the residual is orthogonal to every vector in \(\operatorname{col}(A)\).
Explain why an intercept term in linear regression forces the residuals to sum to zero.

12 Stop Here For First Pass

If you can now explain:

why best approximation becomes orthogonal projection
why the normal equations encode residual orthogonality
how the worked example connects to regression

then this page has done its main job.

13 Go Deeper

Open one of these only if you want a different learning mode:

Projection Theorem and Normal Equations for the full derivation
Linear Regression Through Projection for the cleanest application framing
Computation Lab: Projection Geometry and Regression Residuals for visual and numerical intuition

14 Optional Paper Bridge

A Statistical Perspective on Randomized Sketching for Ordinary Least-Squares - Paper bridge - watch how projection geometry survives after data are compressed by a sketching matrix.
Iterative Hessian Sketch: Fast and Accurate Solution Approximation for Constrained Least-Squares - Paper bridge - watch how least-squares structure is preserved while the algorithm changes.
The Implicit Bias of Benign Overfitting - Paper bridge - watch how least-squares and minimum-norm interpolation reappear in modern overparameterized theory.

15 Optional After First Pass

These are optional after the first pass:

16 Sources and Further Reading

MIT 18.06SC Linear Algebra resource index - First pass - clear official pathway for orthogonality, projections, and least squares, with problem sets and lecture summaries. Checked 2026-04-24.
Stanford Math 51 - First pass - current course framing that ties least squares directly to applications and data-oriented linear algebra. Checked 2026-04-24.
Hefferon, Linear Algebra - Second pass - strong proof-and-exercise treatment that is still friendly to self-study. Checked 2026-04-24.
MIT 2.086 Unit 3: Matrices and Least Squares; Regression from Math, Numerics, and Programming - Second pass - useful when you want the math and the computational interpretation together. Checked 2026-04-24.
A Statistical Perspective on Randomized Sketching for Ordinary Least-Squares - Paper bridge - a clean route from textbook least squares into modern large-scale regression theory. Checked 2026-04-24.

Sources checked online on 2026-04-24:

MIT 18.06SC Linear Algebra resource index
MIT 18.06 Linear Algebra syllabus
Stanford Math 51
Hefferon, Linear Algebra
MIT 2.086 Unit 3 notes
JMLR 2016 randomized sketching paper
JMLR 2016 iterative Hessian sketch paper
JMLR 2023 benign overfitting paper