Projection Theorem and Normal Equations

A proof that least-squares minimizers are exactly the vectors whose residuals are orthogonal to the column space.

Modified

April 26, 2026

Keywords

proof, least squares, projection, normal equations

1 Claim

We want the exact theorem behind the slogan

\(\text{best approximation} = \text{orthogonal projection}\).

Theorem 1 (Theorem) Let \(A \in \mathbb{R}^{m \times n}\) and \(b \in \mathbb{R}^m\). For a vector \(\hat{x} \in \mathbb{R}^n\), the following are equivalent:

\(\hat{x}\) minimizes \(\|Ax - b\|_2^2\).
The residual \(r = b - A\hat{x}\) is orthogonal to \(\operatorname{col}(A)\).
\(A^\top A \hat{x} = A^\top b\).

If the columns of \(A\) are linearly independent, then the minimizer is unique.

2 Why This Proof Matters

This proof is worth learning because it gets reused everywhere:

linear regression
projection methods
constrained optimization with quadratic objectives
QR-based numerical methods
randomized and sketched least-squares arguments

The reusable pattern is:

optimality can be checked by asking whether every feasible first-order perturbation increases the objective.

3 Dependencies

the column space of \(A\) consists of all vectors \(Az\)
orthogonality to a subspace means orthogonality to every vector in it
\(\|u\|_2^2 = u^\top u\)

4 Strategy Before Details

The proof has two directions.

For 2 -> 1, assume the residual is already orthogonal to the column space and show every perturbation makes the objective larger.

For implication \(1 \to 3\), assume \(\hat{x}\) minimizes the objective and probe the objective along arbitrary directions \(\hat{x} + tz\).

That gives the normal equations, and the normal equations are exactly the matrix form of residual orthogonality.

5 Full Proof

Proof. Let

\[ f(x) = \|Ax - b\|_2^2. \]

First prove 2 -> 1.

Assume \(r = b - A\hat{x}\) is orthogonal to \(\operatorname{col}(A)\). Then for any \(h \in \mathbb{R}^n\),

\[ A(\hat{x} + h) - b = A\hat{x} - b + Ah = -r + Ah. \]

Hence

\[ \|A(\hat{x}+h)-b\|_2^2 = \|-r + Ah\|_2^2 = \|r\|_2^2 + \|Ah\|_2^2 - 2 r^\top Ah. \]

Because \(Ah \in \operatorname{col}(A)\) and \(r\) is orthogonal to \(\operatorname{col}(A)\), we have \(r^\top Ah = 0\). Therefore

\[ \|A(\hat{x}+h)-b\|_2^2 = \|r\|_2^2 + \|Ah\|_2^2 \ge \|r\|_2^2 = \|A\hat{x}-b\|_2^2. \]

So \(\hat{x}\) is a least-squares minimizer.

Now prove 1 -> 3.

Assume \(\hat{x}\) minimizes \(f\). Let \(z \in \mathbb{R}^n\) be arbitrary and consider

\[ g(t) = f(\hat{x} + tz) = \|A(\hat{x}+tz)-b\|_2^2. \]

Since \(t = 0\) is a minimizer of the scalar function \(g\), we must have \(g'(0)=0\).

Expand:

\[ g(t) = (A\hat{x}-b + tAz)^\top (A\hat{x}-b + tAz). \]

Differentiating and evaluating at t=0 gives

\[ g'(0) = 2 z^\top A^\top(A\hat{x}-b) = 0. \]

Because this holds for every \(z\), we conclude

\[ A^\top(A\hat{x}-b)=0, \]

which is equivalent to

\[ A^\top A\hat{x} = A^\top b. \]

Finally, 3 <-> 2 is immediate because

\[ A^\top(b-A\hat{x}) = 0 \]

means the residual has zero inner product with every column of \(A\), hence is orthogonal to \(\operatorname{col}(A)\).

If the columns of \(A\) are linearly independent and \(v \neq 0\), then \(Av \neq 0\), so

\[ v^\top A^\top A v = \|Av\|_2^2 > 0. \]

Thus \(A^\top A\) is positive definite and therefore invertible, which makes the minimizer unique.

6 Step Annotations

The perturbation \(Ah\) stays inside the column space. That is the geometry hiding inside the algebra.
The cross term disappears only because the residual is orthogonal to the whole column space, not just to one chosen vector.
The derivative argument works because every direction \(z\) is allowed. That turns a scalar minimization fact into a vector optimality condition.
The matrix \(A^\top A\) appears because the objective uses Euclidean norm and the feasible set is built from \(A\).

7 Why The Assumptions Matter

We used the Euclidean norm. Different norms lead to different optimality conditions.
We did not need full column rank for existence of a minimizer, only for uniqueness.
The orthogonality statement depends on the standard inner product. In a weighted least-squares problem, the residual is orthogonal under a weighted geometry instead.

8 Common Failure Modes

proving only \(A^\top r = 0\) without explaining why that means \(r \perp \operatorname{col}(A)\)
assuming invertibility of \(A^\top A\) too early
forgetting that \(\hat{x}\) can fail to be unique when the columns are dependent
treating the derivative calculation as formal manipulation without justifying the arbitrary direction \(z\)

9 Reusable Pattern

This proof teaches a pattern that appears again and again:

write the objective in a way that isolates the residual
perturb the candidate solution in an arbitrary feasible direction
use orthogonality or vanishing derivative to kill the cross term
conclude optimality

You will see the same structure in:

projection theorems in Hilbert spaces
quadratic optimization
KKT systems for constrained least squares
sketching analyses that compare exact and approximate minimizers

10 Where This Shows Up Again

11 Exercises

Adapt the proof to the special case where the columns of \(A\) are orthonormal.
Show that if \(A\) has dependent columns, then two different minimizers must differ by a vector in \(\operatorname{null}(A)\).

12 Sources and Further Reading

MIT 18.06SC Linear Algebra resource index - First pass - official lecture and problem flow for projections and least squares. Checked 2026-04-24.
Hefferon, Linear Algebra - Second pass - especially good for proof structure and exercises. Checked 2026-04-24.
MIT 2.086 Unit 3 notes - Paper bridge - useful for seeing how the same proof ideas lead into computational regression workflows. Checked 2026-04-24.