Orthogonality and Least Squares

How orthogonality turns approximation into projection and makes linear regression, residual analysis, and numerical least squares fit together.
Modified

April 26, 2026

Keywords

orthogonality, projection, least squares, regression, normal equations

1 Role

This page is the bridge from geometric linear algebra to estimation, optimization, and numerical computation.

Orthogonality tells you what it means to be the best approximation inside a subspace. Least squares turns that geometric idea into one of the most reused tools in statistics, machine learning, signal processing, and scientific computing.

2 First-Pass Promise

Read this page after Subspaces, Basis, and Dimension.

If you stop here, you should still understand:

  • why least squares is a projection problem
  • what the normal equations say
  • how one complete worked example behaves
  • how the topic connects to regression and numerical computation

3 Why It Matters

This topic matters because it is one of the first places where the same mathematical object supports all of these at once:

  • a geometric statement: \(\text{best approximation} = \text{orthogonal projection}\)
  • a computational problem: solve an overdetermined system stably
  • a modeling problem: fit parameters to data
  • a research bridge: understand sketching, inverse problems, and overparameterized regression

4 Prerequisite Recall

  • the column space of a matrix \(A\) is the set of vectors of the form \(Ax\)
  • a vector \(r\) is orthogonal to a subspace \(W\) if it is orthogonal to every vector in \(W\)
  • projecting \(b\) onto a subspace means splitting it into \(b = p + r\), where \(p\) lies in the subspace and \(r\) is orthogonal to it

5 Intuition

When \(Ax = b\) has no exact solution, the vectors you can actually produce are only the vectors in the column space of \(A\).

So the right question is no longer:

Can I hit b exactly?

It becomes:

Which vector in the column space of \(A\) gets closest to \(b\)?

That closest vector is the orthogonal projection of \(b\) onto \(\operatorname{col}(A)\).

The residual

\[ r = b - A\hat{x} \]

is not arbitrary noise. At the optimum it points exactly in the direction that the column space cannot explain. That is why the residual is orthogonal to every column of \(A\).

6 Formal Core

Definition 1 (Definition) For a matrix \(A \in \mathbb{R}^{m \times n}\) and data vector \(b \in \mathbb{R}^m\), the linear least-squares problem is

\[ \hat{x} \in \arg\min_{x \in \mathbb{R}^n} \|Ax - b\|_2^2. \]

The fitted vector is \(A\hat{x}\) and the residual is \(r = b - A\hat{x}\).

The key geometric fact is that \(A\hat{x}\) must be the orthogonal projection of \(b\) onto \(\operatorname{col}(A)\).

Proposition 1 (Projection Principle) A vector \(\hat{x}\) solves the least-squares problem if and only if the residual \(r = b - A\hat{x}\) is orthogonal to the column space of \(A\).

Equivalently,

\[ A^\top (b - A\hat{x}) = 0. \]

Rearranging gives the normal equations:

\[ A^\top A \hat{x} = A^\top b. \tag{1}\]

If the columns of \(A\) are linearly independent, then \(A^\top A\) is invertible and the least-squares solution is unique:

\[ \hat{x} = (A^\top A)^{-1} A^\top b. \]

This closed form is conceptually useful, but in practice it is usually better to solve least squares through QR or SVD, not by explicitly forming \((A^\top A)^{-1}\).

7 Worked Example

Fit a line \(y \approx \beta_0 + \beta_1 t\) through the three points (0,1), (1,2), (2,2).

Write this as \(A\beta \approx b\), with

\[ A = \begin{bmatrix} 1 & 0 \\ 1 & 1 \\ 1 & 2 \end{bmatrix}, \qquad \beta = \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix}, \qquad b = \begin{bmatrix} 1 \\ 2 \\ 2 \end{bmatrix}. \]

Then

\[ A^\top A = \begin{bmatrix} 3 & 3 \\ 3 & 5 \end{bmatrix}, \qquad A^\top b = \begin{bmatrix} 5 \\ 6 \end{bmatrix}. \]

Solving Equation 1 gives

\[ \hat{\beta} = \begin{bmatrix} 7/6 \\ 1/2 \end{bmatrix}. \]

So the least-squares line is

\[ \hat{y}(t) = \frac{7}{6} + \frac{1}{2} t. \]

The fitted values are

\[ A\hat{\beta} = \begin{bmatrix} 7/6 \\ 5/3 \\ 13/6 \end{bmatrix}, \qquad r = b - A\hat{\beta} = \begin{bmatrix} -1/6 \\ 1/3 \\ -1/6 \end{bmatrix}. \]

Now check the orthogonality conditions:

\[ \mathbf{1}^\top r = 0, \qquad \begin{bmatrix} 0 & 1 & 2 \end{bmatrix} r = 0. \]

The residual is orthogonal to both columns of \(A\), so it is orthogonal to the whole column space. That is exactly the projection principle at work.

8 Computation Lens

The mathematics says \(A^\top A \hat{x} = A^\top b\), but numerical linear algebra immediately asks a second question:

What is the stable way to solve it?

Three levels matter:

  • normal equations: conceptually clean, but can square the condition number
  • QR factorization: the standard stable method for many full-rank problems
  • SVD: more expensive, but the safest when rank deficiency or near-rank deficiency matters

This is why least squares is not only a modeling tool. It is also a gateway to numerical thinking: conditioning, orthogonalization, and low-rank structure all show up here.

9 Application Lens

In linear regression, the design matrix \(X\) plays the role of \(A\), the coefficient vector \(\beta\) plays the role of \(x\), and the target vector \(y\) plays the role of \(b\).

The fitted prediction \(X\hat{\beta}\) is the orthogonal projection of \(y\) onto the column space of \(X\).

That one sentence explains several practical facts:

  • residuals are orthogonal to every feature column
  • with an intercept column, residuals sum to zero
  • near-collinearity can make coefficient recovery and normal-equation solves unstable
  • overparameterized models push attention toward minimum-norm or regularized solutions

For a fuller application walkthrough, see Linear Regression Through Projection. For an interactive picture, see Computation Lab: Projection Geometry and Regression Residuals.

10 Common Mistakes

  • thinking the normal equations are automatically the best numerical algorithm
  • forgetting that \(A^\top A\) is invertible only when the columns of \(A\) are linearly independent
  • confusing projection onto the column space with projection onto the row space
  • treating the residual as an arbitrary error vector instead of a vector forced into the orthogonal complement
  • using least squares as a default model choice even when the data are highly nonlinear, badly contaminated, or structurally constrained

11 Exercises

  1. Show that if the columns of \(A\) are orthonormal, then the least-squares solution is \(\hat{x} = A^\top b\).
  2. In the worked example above, verify directly that the residual is orthogonal to every vector in \(\operatorname{col}(A)\).
  3. Explain why an intercept term in linear regression forces the residuals to sum to zero.

12 Stop Here For First Pass

If you can now explain:

  • why best approximation becomes orthogonal projection
  • why the normal equations encode residual orthogonality
  • how the worked example connects to regression

then this page has done its main job.

13 Go Deeper

Open one of these only if you want a different learning mode:

  1. Projection Theorem and Normal Equations for the full derivation
  2. Linear Regression Through Projection for the cleanest application framing
  3. Computation Lab: Projection Geometry and Regression Residuals for visual and numerical intuition

14 Optional Paper Bridge

15 Optional After First Pass

These are optional after the first pass:

16 Sources and Further Reading

Sources checked online on 2026-04-24:

  • MIT 18.06SC Linear Algebra resource index
  • MIT 18.06 Linear Algebra syllabus
  • Stanford Math 51
  • Hefferon, Linear Algebra
  • MIT 2.086 Unit 3 notes
  • JMLR 2016 randomized sketching paper
  • JMLR 2016 iterative Hessian sketch paper
  • JMLR 2023 benign overfitting paper
Back to top