Projection Theorem and Normal Equations
proof, least squares, projection, normal equations
1 Claim
We want the exact theorem behind the slogan
\(\text{best approximation} = \text{orthogonal projection}\).
Theorem 1 (Theorem) Let \(A \in \mathbb{R}^{m \times n}\) and \(b \in \mathbb{R}^m\). For a vector \(\hat{x} \in \mathbb{R}^n\), the following are equivalent:
- \(\hat{x}\) minimizes \(\|Ax - b\|_2^2\).
- The residual \(r = b - A\hat{x}\) is orthogonal to \(\operatorname{col}(A)\).
- \(A^\top A \hat{x} = A^\top b\).
If the columns of \(A\) are linearly independent, then the minimizer is unique.
2 Why This Proof Matters
This proof is worth learning because it gets reused everywhere:
- linear regression
- projection methods
- constrained optimization with quadratic objectives
- QR-based numerical methods
- randomized and sketched least-squares arguments
The reusable pattern is:
optimality can be checked by asking whether every feasible first-order perturbation increases the objective.
3 Dependencies
- the column space of \(A\) consists of all vectors \(Az\)
- orthogonality to a subspace means orthogonality to every vector in it
- \(\|u\|_2^2 = u^\top u\)
4 Strategy Before Details
The proof has two directions.
For 2 -> 1, assume the residual is already orthogonal to the column space and show every perturbation makes the objective larger.
For implication \(1 \to 3\), assume \(\hat{x}\) minimizes the objective and probe the objective along arbitrary directions \(\hat{x} + tz\).
That gives the normal equations, and the normal equations are exactly the matrix form of residual orthogonality.
5 Full Proof
Proof. Let
\[ f(x) = \|Ax - b\|_2^2. \]
First prove 2 -> 1.
Assume \(r = b - A\hat{x}\) is orthogonal to \(\operatorname{col}(A)\). Then for any \(h \in \mathbb{R}^n\),
\[ A(\hat{x} + h) - b = A\hat{x} - b + Ah = -r + Ah. \]
Hence
\[ \|A(\hat{x}+h)-b\|_2^2 = \|-r + Ah\|_2^2 = \|r\|_2^2 + \|Ah\|_2^2 - 2 r^\top Ah. \]
Because \(Ah \in \operatorname{col}(A)\) and \(r\) is orthogonal to \(\operatorname{col}(A)\), we have \(r^\top Ah = 0\). Therefore
\[ \|A(\hat{x}+h)-b\|_2^2 = \|r\|_2^2 + \|Ah\|_2^2 \ge \|r\|_2^2 = \|A\hat{x}-b\|_2^2. \]
So \(\hat{x}\) is a least-squares minimizer.
Now prove 1 -> 3.
Assume \(\hat{x}\) minimizes \(f\). Let \(z \in \mathbb{R}^n\) be arbitrary and consider
\[ g(t) = f(\hat{x} + tz) = \|A(\hat{x}+tz)-b\|_2^2. \]
Since \(t = 0\) is a minimizer of the scalar function \(g\), we must have \(g'(0)=0\).
Expand:
\[ g(t) = (A\hat{x}-b + tAz)^\top (A\hat{x}-b + tAz). \]
Differentiating and evaluating at t=0 gives
\[ g'(0) = 2 z^\top A^\top(A\hat{x}-b) = 0. \]
Because this holds for every \(z\), we conclude
\[ A^\top(A\hat{x}-b)=0, \]
which is equivalent to
\[ A^\top A\hat{x} = A^\top b. \]
Finally, 3 <-> 2 is immediate because
\[ A^\top(b-A\hat{x}) = 0 \]
means the residual has zero inner product with every column of \(A\), hence is orthogonal to \(\operatorname{col}(A)\).
If the columns of \(A\) are linearly independent and \(v \neq 0\), then \(Av \neq 0\), so
\[ v^\top A^\top A v = \|Av\|_2^2 > 0. \]
Thus \(A^\top A\) is positive definite and therefore invertible, which makes the minimizer unique.
6 Step Annotations
- The perturbation \(Ah\) stays inside the column space. That is the geometry hiding inside the algebra.
- The cross term disappears only because the residual is orthogonal to the whole column space, not just to one chosen vector.
- The derivative argument works because every direction \(z\) is allowed. That turns a scalar minimization fact into a vector optimality condition.
- The matrix \(A^\top A\) appears because the objective uses Euclidean norm and the feasible set is built from \(A\).
7 Why The Assumptions Matter
- We used the Euclidean norm. Different norms lead to different optimality conditions.
- We did not need full column rank for existence of a minimizer, only for uniqueness.
- The orthogonality statement depends on the standard inner product. In a weighted least-squares problem, the residual is orthogonal under a weighted geometry instead.
8 Common Failure Modes
- proving only \(A^\top r = 0\) without explaining why that means \(r \perp \operatorname{col}(A)\)
- assuming invertibility of \(A^\top A\) too early
- forgetting that \(\hat{x}\) can fail to be unique when the columns are dependent
- treating the derivative calculation as formal manipulation without justifying the arbitrary direction \(z\)
9 Reusable Pattern
This proof teaches a pattern that appears again and again:
- write the objective in a way that isolates the residual
- perturb the candidate solution in an arbitrary feasible direction
- use orthogonality or vanishing derivative to kill the cross term
- conclude optimality
You will see the same structure in:
- projection theorems in Hilbert spaces
- quadratic optimization
- KKT systems for constrained least squares
- sketching analyses that compare exact and approximate minimizers
10 Where This Shows Up Again
11 Exercises
- Adapt the proof to the special case where the columns of \(A\) are orthonormal.
- Show that if \(A\) has dependent columns, then two different minimizers must differ by a vector in \(\operatorname{null}(A)\).
12 Sources and Further Reading
- MIT 18.06SC Linear Algebra resource index -
First pass- official lecture and problem flow for projections and least squares. Checked2026-04-24. - Hefferon, Linear Algebra -
Second pass- especially good for proof structure and exercises. Checked2026-04-24. - MIT 2.086 Unit 3 notes -
Paper bridge- useful for seeing how the same proof ideas lead into computational regression workflows. Checked2026-04-24.