Kernel Methods and Similarity Geometry
kernels, Gram matrix, feature map, similarity, kernel trick
1 Application Snapshot
Kernel methods let a model stay linear in one space while behaving nonlinearly in the original input space.
The central move is:
replace raw dot products with a similarity function that acts like a dot product after an implicit feature map
That turns geometry into prediction.
2 Problem Setting
Suppose we want a predictor that depends on how similar a new point \(x\) is to training points \(x_1,\dots,x_n\).
Instead of using only raw coordinates, we define a kernel
\[ K(x,z) = \langle \phi(x), \phi(z) \rangle, \]
where \(\phi\) is a possibly very large feature map that we may never write down explicitly.
Then many prediction rules can be expressed in the form
\[ f(x) = \sum_{i=1}^n \alpha_i K(x_i, x). \]
So the prediction is built from weighted similarities to training examples.
3 Why This Math Appears
Kernel methods sit at a useful meeting point:
Linear Algebra: the kernel matrix or Gram matrix stores all pairwise similaritiesStatistics: predictions are built from sample points, so regularization and model complexity matterOptimization: many kernel methods are learned by solving regularized objectives or dual problemsProbability: smooth kernels often encode locality or similarity assumptions about data
This is why kernels are such a good bridge topic. They turn abstract geometry into a concrete prediction mechanism.
4 Math Objects In Use
- feature map \(\phi(x)\)
- kernel \(K(x,z)\)
- Gram matrix \(G\) with entries \(G_{ij}=K(x_i,x_j)\)
- coefficients \(\alpha_i\)
- prediction rule \(f(x)=\sum_i \alpha_i K(x_i,x)\)
5 A Small Worked Walkthrough
Take three training inputs on the real line:
\[ x_1 = 0, \qquad x_2 = 1, \qquad x_3 = 2, \]
with targets
\[ y_1 = 0, \qquad y_2 = 1, \qquad y_3 = 0.5. \]
Use the Gaussian kernel
\[ K(x,z) = e^{-(x-z)^2}. \]
Now evaluate a new point \(x=1.2\). Its similarities to the training points are
\[ K(1.2,0) \approx e^{-1.44} \approx 0.237, \]
\[ K(1.2,1) \approx e^{-0.04} \approx 0.961, \]
\[ K(1.2,2) \approx e^{-0.64} \approx 0.527. \]
If we normalize these into weights,
\[ w \approx (0.137,\;0.557,\;0.306), \]
then a simple similarity-weighted prediction is
\[ \hat{y} \approx 0.137(0) + 0.557(1) + 0.306(0.5) \approx 0.710. \]
The geometric point is the important one:
- the point at \(x=1\) influences the prediction most because it is most similar
- the farther points still contribute, but less
- the kernel has turned local similarity into a nonlinear prediction rule
6 Implementation or Computation Note
Many kernel algorithms can be written entirely in terms of the Gram matrix
\[ G_{ij} = K(x_i,x_j). \]
That is the practical face of the kernel trick:
- you never need to construct \(\phi(x)\) explicitly
- you only need pairwise kernel evaluations
- a linear algorithm in feature space becomes a nonlinear algorithm in the original input space
The tradeoff is computational: kernel methods often require storing or manipulating an \(n \times n\) matrix, which becomes expensive for large datasets.
7 Failure Modes
- treating any informal similarity score as a valid kernel
- forgetting that bandwidth or scale choices can completely change behavior
- assuming the kernel trick makes models free, when Gram-matrix costs can be large
- confusing locality or similarity with causal explanation
- using a flexible kernel without enough regularization or validation
8 Paper Bridge
- CS229 Lecture 8: Kernels -
First pass- official lecture bridge from linear classification to kernelized nonlinear decision rules. Checked2026-04-24. - CS229T Notes -
Paper bridge- official theory notes pointing from practical kernel ideas toward RKHS and statistical learning theory. Checked2026-04-24.
9 Sources and Further Reading
- CS229 Notes on SVMs and Kernel Methods -
First pass- official Stanford notes explaining the kernel trick and the feature-space viewpoint. Checked2026-04-24. - CS229 Lecture 8: Kernels -
First pass- official lecture page for seeing where kernels enter a standard ML curriculum. Checked2026-04-24. - CS229T Notes -
Second pass- official theory notes that extend kernel ideas toward statistical learning theory. Checked2026-04-24. - CS229: Machine Learning -
Second pass- current course hub for locating the broader ML context around kernels, SVMs, and generalization. Checked2026-04-24.