Linear Regression Through Projection

A concrete application page showing how orthogonality and least squares become linear regression.
Modified

April 26, 2026

Keywords

application, regression, projection, least squares

1 Application Snapshot

Linear regression is the cleanest applied face of least squares: the prediction vector is the orthogonal projection of the observed response onto the column space of the design matrix.

That one statement explains fitted values, residual structure, computational choices, and why the same geometry keeps reappearing in modern large-scale and overparameterized regression.

2 Problem Setting

Suppose we observe inputs \(x_1,\dots,x_m\) and responses \(y_1,\dots,y_m\), and we want a linear model

\[ \hat{y}_i = \beta_0 + \beta_1 x_i^{(1)} + \cdots + \beta_p x_i^{(p)}. \]

Stacking the data gives the matrix equation

\[ X\beta \approx y, \]

where \(X\) is the design matrix, \(\beta\) is the coefficient vector, and \(y\) is the response vector.

When the system is overdetermined or noisy, exact equality usually fails. Linear regression asks for the coefficient vector that minimizes the residual sum of squares:

\[ \hat{\beta} \in \arg\min_\beta \|X\beta - y\|_2^2. \]

3 Why This Math Appears

The model predictions \(X\beta\) live in the column space of \(X\).

So regression is not just “fit a line” or “fit a hyperplane.” It is:

project the observed response vector y onto the subspace of responses that the model can express.

Orthogonality then explains the fitted solution:

  • \(X\hat{\beta}\) is the projection of \(y\) onto \(\operatorname{col}(X)\)
  • \(r = y - X\hat{\beta}\) is orthogonal to every feature column
  • the normal equations \(X^\top X \hat{\beta} = X^\top y\) are exactly the matrix form of that orthogonality

4 Math Objects In Use

  • the design matrix \(X\)
  • the column space \(\operatorname{col}(X)\)
  • the residual vector \(r = y - X\hat{\beta}\)
  • the Gram matrix \(X^\top X\)
  • orthogonal projection
  • QR or SVD for stable computation

5 Worked Walkthrough

Take the three points (0,1), (1,2), (2,2) and fit a line \(y \approx \beta_0 + \beta_1 t\).

Then

\[ X = \begin{bmatrix} 1 & 0 \\ 1 & 1 \\ 1 & 2 \end{bmatrix}, \qquad y = \begin{bmatrix} 1 \\ 2 \\ 2 \end{bmatrix}. \]

The normal equations are

\[ X^\top X \hat{\beta} = X^\top y, \]

with

\[ X^\top X = \begin{bmatrix} 3 & 3 \\ 3 & 5 \end{bmatrix}, \qquad X^\top y = \begin{bmatrix} 5 \\ 6 \end{bmatrix}. \]

Solving gives

\[ \hat{\beta} = \begin{bmatrix} 7/6 \\ 1/2 \end{bmatrix}. \]

So the fitted line is

\[ \hat{y}(t) = \frac{7}{6} + \frac{1}{2} t. \]

The residual vector is

\[ r = \begin{bmatrix} -1/6 \\ 1/3 \\ -1/6 \end{bmatrix}. \]

Because the first column of \(X\) is the intercept column, \(\mathbf{1}^\top r = 0\), so the residuals sum to zero.

Because the second column is \((0,1,2)^\top\), we also have

\[ \begin{bmatrix} 0 & 1 & 2 \end{bmatrix} r = 0. \]

So the regression residual is orthogonal to both explanatory directions used by the model.

That already shows the application-specific payoff of the geometry:

  • the intercept column forces a mean-zero residual
  • the feature column forces a zero weighted residual in the fitted direction
  • the model output returned by software is a projection object, not just a pair of coefficients

6 Implementation or Computation Note

The formula

\[ \hat{\beta} = (X^\top X)^{-1}X^\top y \]

is mathematically convenient but only valid when \(X\) has full column rank, so that \(X^\top X\) is invertible.

Even in that case, it is computationally fragile.

In practice:

  • avoid explicit matrix inverses
  • prefer QR for standard full-rank least squares
  • use SVD when rank deficiency or near-collinearity matters
  • scale and center thoughtfully when the feature magnitudes differ a lot

This is why linear regression belongs to both statistics and numerical linear algebra.

If you want to inspect these residual conditions directly, use the paired Computation Lab: Projection Geometry and Regression Residuals.

7 Failure Modes

  • collinearity: if columns of \(X\) are nearly dependent, the fitted coefficients can be unstable
  • outliers: squared loss can let a few large residuals dominate the fit
  • model mismatch: projection only finds the best fit inside the chosen model class
  • overparameterization: when \(p \ge m\), there can be many interpolating solutions and geometry shifts toward minimum-norm or regularized solutions

8 Paper Bridge

9 Try It

  1. Recompute the worked example after replacing the last response value 2 by 4. Track how the slope, intercept, and residual orthogonality checks change.
  2. Add an intercept-free model and compare the residual sum and residual pattern.
  3. Solve the same least-squares problem once with the normal equations and once with a QR factorization in software, then compare what each method is actually computing.

10 Sources and Further Reading

Back to top