Linear Regression Through Projection
application, regression, projection, least squares
1 Application Snapshot
Linear regression is the cleanest applied face of least squares: the prediction vector is the orthogonal projection of the observed response onto the column space of the design matrix.
That one statement explains fitted values, residual structure, computational choices, and why the same geometry keeps reappearing in modern large-scale and overparameterized regression.
2 Problem Setting
Suppose we observe inputs \(x_1,\dots,x_m\) and responses \(y_1,\dots,y_m\), and we want a linear model
\[ \hat{y}_i = \beta_0 + \beta_1 x_i^{(1)} + \cdots + \beta_p x_i^{(p)}. \]
Stacking the data gives the matrix equation
\[ X\beta \approx y, \]
where \(X\) is the design matrix, \(\beta\) is the coefficient vector, and \(y\) is the response vector.
When the system is overdetermined or noisy, exact equality usually fails. Linear regression asks for the coefficient vector that minimizes the residual sum of squares:
\[ \hat{\beta} \in \arg\min_\beta \|X\beta - y\|_2^2. \]
3 Why This Math Appears
The model predictions \(X\beta\) live in the column space of \(X\).
So regression is not just “fit a line” or “fit a hyperplane.” It is:
project the observed response vector y onto the subspace of responses that the model can express.
Orthogonality then explains the fitted solution:
- \(X\hat{\beta}\) is the projection of \(y\) onto \(\operatorname{col}(X)\)
- \(r = y - X\hat{\beta}\) is orthogonal to every feature column
- the normal equations \(X^\top X \hat{\beta} = X^\top y\) are exactly the matrix form of that orthogonality
4 Math Objects In Use
- the design matrix \(X\)
- the column space \(\operatorname{col}(X)\)
- the residual vector \(r = y - X\hat{\beta}\)
- the Gram matrix \(X^\top X\)
- orthogonal projection
QRorSVDfor stable computation
5 Worked Walkthrough
Take the three points (0,1), (1,2), (2,2) and fit a line \(y \approx \beta_0 + \beta_1 t\).
Then
\[ X = \begin{bmatrix} 1 & 0 \\ 1 & 1 \\ 1 & 2 \end{bmatrix}, \qquad y = \begin{bmatrix} 1 \\ 2 \\ 2 \end{bmatrix}. \]
The normal equations are
\[ X^\top X \hat{\beta} = X^\top y, \]
with
\[ X^\top X = \begin{bmatrix} 3 & 3 \\ 3 & 5 \end{bmatrix}, \qquad X^\top y = \begin{bmatrix} 5 \\ 6 \end{bmatrix}. \]
Solving gives
\[ \hat{\beta} = \begin{bmatrix} 7/6 \\ 1/2 \end{bmatrix}. \]
So the fitted line is
\[ \hat{y}(t) = \frac{7}{6} + \frac{1}{2} t. \]
The residual vector is
\[ r = \begin{bmatrix} -1/6 \\ 1/3 \\ -1/6 \end{bmatrix}. \]
Because the first column of \(X\) is the intercept column, \(\mathbf{1}^\top r = 0\), so the residuals sum to zero.
Because the second column is \((0,1,2)^\top\), we also have
\[ \begin{bmatrix} 0 & 1 & 2 \end{bmatrix} r = 0. \]
So the regression residual is orthogonal to both explanatory directions used by the model.
That already shows the application-specific payoff of the geometry:
- the intercept column forces a mean-zero residual
- the feature column forces a zero weighted residual in the fitted direction
- the model output returned by software is a projection object, not just a pair of coefficients
6 Implementation or Computation Note
The formula
\[ \hat{\beta} = (X^\top X)^{-1}X^\top y \]
is mathematically convenient but only valid when \(X\) has full column rank, so that \(X^\top X\) is invertible.
Even in that case, it is computationally fragile.
In practice:
- avoid explicit matrix inverses
- prefer
QRfor standard full-rank least squares - use
SVDwhen rank deficiency or near-collinearity matters - scale and center thoughtfully when the feature magnitudes differ a lot
This is why linear regression belongs to both statistics and numerical linear algebra.
If you want to inspect these residual conditions directly, use the paired Computation Lab: Projection Geometry and Regression Residuals.
7 Failure Modes
collinearity: if columns of \(X\) are nearly dependent, the fitted coefficients can be unstableoutliers: squared loss can let a few large residuals dominate the fitmodel mismatch: projection only finds the best fit inside the chosen model classoverparameterization: when \(p \ge m\), there can be many interpolating solutions and geometry shifts toward minimum-norm or regularized solutions
8 Paper Bridge
- A Statistical Perspective on Randomized Sketching for Ordinary Least-Squares -
Paper bridge- keeps the same regression objective while shrinking the data through sketching. - The Implicit Bias of Benign Overfitting -
Paper bridge- shows how minimum-norm interpolation changes the regression story in high dimensions. - Benign Overfitting of Constant-Stepsize SGD for Linear Regression -
Paper bridge- connects linear regression geometry to algorithmic regularization and modern optimization behavior.
9 Try It
- Recompute the worked example after replacing the last response value
2by4. Track how the slope, intercept, and residual orthogonality checks change. - Add an intercept-free model and compare the residual sum and residual pattern.
- Solve the same least-squares problem once with the normal equations and once with a
QRfactorization in software, then compare what each method is actually computing.
10 Sources and Further Reading
- MIT 18.06 Linear Algebra syllabus -
First pass- explicitly highlights least squares as “closest line by understanding projections.” Checked2026-04-24. - Stanford Math 51 -
First pass- current official course framing that places least squares and linear regression inside applied linear algebra. Checked2026-04-24. - MIT 2.086 Unit 3 notes -
Second pass- useful for the regression-plus-numerics viewpoint. Checked2026-04-24. - A Statistical Perspective on Randomized Sketching for Ordinary Least-Squares -
Paper bridge- a strong next step once the projection view of regression is comfortable. Checked2026-04-24.