Linear Regression Through Projection

A concrete application page showing how orthogonality and least squares become linear regression.

Modified

April 26, 2026

Keywords

application, regression, projection, least squares

1 Application Snapshot

Linear regression is the cleanest applied face of least squares: the prediction vector is the orthogonal projection of the observed response onto the column space of the design matrix.

That one statement explains fitted values, residual structure, computational choices, and why the same geometry keeps reappearing in modern large-scale and overparameterized regression.

2 Problem Setting

Suppose we observe inputs \(x_1,\dots,x_m\) and responses \(y_1,\dots,y_m\), and we want a linear model

\[ \hat{y}_i = \beta_0 + \beta_1 x_i^{(1)} + \cdots + \beta_p x_i^{(p)}. \]

Stacking the data gives the matrix equation

\[ X\beta \approx y, \]

where \(X\) is the design matrix, \(\beta\) is the coefficient vector, and \(y\) is the response vector.

When the system is overdetermined or noisy, exact equality usually fails. Linear regression asks for the coefficient vector that minimizes the residual sum of squares:

\[ \hat{\beta} \in \arg\min_\beta \|X\beta - y\|_2^2. \]

3 Why This Math Appears

The model predictions \(X\beta\) live in the column space of \(X\).

So regression is not just “fit a line” or “fit a hyperplane.” It is:

project the observed response vector y onto the subspace of responses that the model can express.

Orthogonality then explains the fitted solution:

\(X\hat{\beta}\) is the projection of \(y\) onto \(\operatorname{col}(X)\)
\(r = y - X\hat{\beta}\) is orthogonal to every feature column
the normal equations \(X^\top X \hat{\beta} = X^\top y\) are exactly the matrix form of that orthogonality

4 Math Objects In Use

the design matrix \(X\)
the column space \(\operatorname{col}(X)\)
the residual vector \(r = y - X\hat{\beta}\)
the Gram matrix \(X^\top X\)
orthogonal projection
QR or SVD for stable computation

5 Worked Walkthrough

Take the three points (0,1), (1,2), (2,2) and fit a line \(y \approx \beta_0 + \beta_1 t\).

Then

\[ X = \begin{bmatrix} 1 & 0 \\ 1 & 1 \\ 1 & 2 \end{bmatrix}, \qquad y = \begin{bmatrix} 1 \\ 2 \\ 2 \end{bmatrix}. \]

The normal equations are

\[ X^\top X \hat{\beta} = X^\top y, \]

with

\[ X^\top X = \begin{bmatrix} 3 & 3 \\ 3 & 5 \end{bmatrix}, \qquad X^\top y = \begin{bmatrix} 5 \\ 6 \end{bmatrix}. \]

Solving gives

\[ \hat{\beta} = \begin{bmatrix} 7/6 \\ 1/2 \end{bmatrix}. \]

So the fitted line is

\[ \hat{y}(t) = \frac{7}{6} + \frac{1}{2} t. \]

The residual vector is

\[ r = \begin{bmatrix} -1/6 \\ 1/3 \\ -1/6 \end{bmatrix}. \]

Because the first column of \(X\) is the intercept column, \(\mathbf{1}^\top r = 0\), so the residuals sum to zero.

Because the second column is \((0,1,2)^\top\), we also have

\[ \begin{bmatrix} 0 & 1 & 2 \end{bmatrix} r = 0. \]

So the regression residual is orthogonal to both explanatory directions used by the model.

That already shows the application-specific payoff of the geometry:

the intercept column forces a mean-zero residual
the feature column forces a zero weighted residual in the fitted direction
the model output returned by software is a projection object, not just a pair of coefficients

6 Implementation or Computation Note

The formula

\[ \hat{\beta} = (X^\top X)^{-1}X^\top y \]

is mathematically convenient but only valid when \(X\) has full column rank, so that \(X^\top X\) is invertible.

Even in that case, it is computationally fragile.

In practice:

avoid explicit matrix inverses
prefer QR for standard full-rank least squares
use SVD when rank deficiency or near-collinearity matters
scale and center thoughtfully when the feature magnitudes differ a lot

This is why linear regression belongs to both statistics and numerical linear algebra.

If you want to inspect these residual conditions directly, use the paired Computation Lab: Projection Geometry and Regression Residuals.

7 Failure Modes

collinearity: if columns of \(X\) are nearly dependent, the fitted coefficients can be unstable
outliers: squared loss can let a few large residuals dominate the fit
model mismatch: projection only finds the best fit inside the chosen model class
overparameterization: when \(p \ge m\), there can be many interpolating solutions and geometry shifts toward minimum-norm or regularized solutions

8 Paper Bridge

A Statistical Perspective on Randomized Sketching for Ordinary Least-Squares - Paper bridge - keeps the same regression objective while shrinking the data through sketching.
The Implicit Bias of Benign Overfitting - Paper bridge - shows how minimum-norm interpolation changes the regression story in high dimensions.
Benign Overfitting of Constant-Stepsize SGD for Linear Regression - Paper bridge - connects linear regression geometry to algorithmic regularization and modern optimization behavior.

9 Try It

Recompute the worked example after replacing the last response value 2 by 4. Track how the slope, intercept, and residual orthogonality checks change.
Add an intercept-free model and compare the residual sum and residual pattern.
Solve the same least-squares problem once with the normal equations and once with a QR factorization in software, then compare what each method is actually computing.

10 Sources and Further Reading

MIT 18.06 Linear Algebra syllabus - First pass - explicitly highlights least squares as “closest line by understanding projections.” Checked 2026-04-24.
Stanford Math 51 - First pass - current official course framing that places least squares and linear regression inside applied linear algebra. Checked 2026-04-24.
MIT 2.086 Unit 3 notes - Second pass - useful for the regression-plus-numerics viewpoint. Checked 2026-04-24.
A Statistical Perspective on Randomized Sketching for Ordinary Least-Squares - Paper bridge - a strong next step once the projection view of regression is comfortable. Checked 2026-04-24.