Oversmoothing, Depth, and Graph Sampling
oversmoothing, graph neural networks, message passing, GraphSAGE, graph sampling
1 Application Snapshot
Deeper models help in many areas of ML, but graph neural networks often hit a different regime:
more layers can make node representations too similar
That phenomenon is called oversmoothing.
At the same time, deeper message passing expands the effective neighborhood of each node very quickly, so computation and noise both grow. That is one reason graph sampling became a core practical tool.
2 Problem Setting
A simple graph propagation layer can be written as
\[ H^{(\ell+1)} = P H^{(\ell)} W^{(\ell)}, \]
where:
- \(H^{(\ell)}\) is the node-feature matrix at layer \(\ell\)
- \(P\) is a normalized propagation matrix
- \(W^{(\ell)}\) is a learned weight matrix
If we ignore the learned weights for a moment and focus only on propagation, repeated layers look like
\[ H^{(L)} \approx P^L H^{(0)}. \]
So the same question from graph diffusion returns:
what happens when we keep applying the same graph-mixing operator?
3 Why This Math Appears
This page lives directly on top of earlier pages:
- Graph Diffusion and Message Passing: message passing is learned propagation
- Spectral Modes in Consensus and Graphs: repeated graph updates are matrix powers
- Regularization, Implicit Bias, and Model Complexity: depth and architecture change which solutions and geometries the model prefers
So oversmoothing is not an isolated GNN pathology. It is what happens when a propagation operator keeps averaging away distinctions that the task still needs.
4 Math Objects In Use
- graph propagation matrix \(P\)
- repeated powers \(P^L\)
- node-feature matrix \(H\)
- neighborhood radius induced by depth
- sampling or aggregation budget per node
5 A Small Worked Walkthrough
Take the same 3-node line-style propagation matrix used for diffusion intuition:
\[ P = \begin{bmatrix} \tfrac{1}{2} & \tfrac{1}{2} & 0 \\ \tfrac{1}{3} & \tfrac{1}{3} & \tfrac{1}{3} \\ 0 & \tfrac{1}{2} & \tfrac{1}{2} \end{bmatrix}. \]
Start with one scalar feature at each node:
\[ H^{(0)} = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}. \]
After one propagation step,
\[ H^{(1)} = P H^{(0)} = \begin{bmatrix} \tfrac{1}{2} \\ \tfrac{1}{3} \\ 0 \end{bmatrix}. \]
After two steps,
\[ H^{(2)} = P H^{(1)} = \begin{bmatrix} \tfrac{5}{12} \\ \tfrac{5}{18} \\ \tfrac{1}{6} \end{bmatrix}. \]
After many steps, the entries keep moving toward a shared smoothed pattern. The graph is mixing information across nodes.
That mixing is useful at first:
- local noise is reduced
- neighboring structure is incorporated
But if we keep going, node representations can become too similar to distinguish. That is oversmoothing.
The second practical issue is neighborhood explosion. If each node aggregates from, say, \(d\) neighbors per layer, then \(L\) layers can expose information from roughly \(d^L\) local paths. For large graphs, this quickly becomes expensive.
This is why methods like GraphSAGE sample only part of each neighborhood during training. Instead of aggregating all neighbors at each layer, the model may sample a fixed number such as:
- 25 neighbors at layer 1
- 10 neighbors at layer 2
This does not remove the underlying diffusion logic, but it keeps the computation manageable and the effective receptive field better controlled.
6 Implementation or Computation Note
In practical GNN design, depth is tied to three competing effects:
- more layers let information travel farther
- more layers can wash node features together
- more layers increase memory and neighborhood cost
Common mitigations include:
- residual or skip connections
- normalization
- jumping-knowledge style architectures
- neighborhood sampling
- restricting depth and using stronger features
Another response is to change the graph itself. Graph Rewiring, Homophily, and Heterophily looks at that option directly.
It is also important not to confuse this page’s main failure mode with Long-Range Dependence and Oversquashing in Graphs. Oversmoothing is about node states becoming too similar; oversquashing is about too much distant information being forced through too small a bottleneck.
So “make the GNN deeper” is not a neutral choice. It changes both the operator and the geometry of the learned representations.
7 Failure Modes
- assuming deeper message passing is always better
- confusing oversmoothing with every possible GNN failure mode
- ignoring neighborhood explosion and training cost on large graphs
- treating graph sampling as only an implementation trick rather than a modeling choice about locality
- forgetting that some recent results show oversmoothing is nuanced and depends on operator and initialization, not only on layer count
8 Paper Bridge
- Inductive Representation Learning on Large Graphs -
First pass- GraphSAGE is the classic bridge for understanding sampled neighborhood aggregation in large graphs. Checked2026-04-24. - A Non-Asymptotic Analysis of Oversmoothing in Graph Neural Networks -
Paper bridge- theory-facing paper that makes the depth/oversmoothing tradeoff explicit. Checked2026-04-24.
9 Sources and Further Reading
- CS224W | Machine Learning with Graphs -
First pass- official Stanford graph-learning course hub for the broader context around sampling, propagation, and GNN design. Checked2026-04-24. - Semi-Supervised Classification with Graph Convolutional Networks -
First pass- canonical GCN paper for the basic propagation layer that makes oversmoothing intuitive. Checked2026-04-24. - Inductive Representation Learning on Large Graphs -
First pass- GraphSAGE paper for sampled message passing on large graphs. Checked2026-04-24. - A Non-Asymptotic Analysis of Oversmoothing in Graph Neural Networks -
Second pass- precise modern analysis of why oversmoothing appears at finite depth. Checked2026-04-24. - Graph Neural Networks Do Not Always Oversmooth -
Paper bridge- useful current nuance showing the story depends on the regime and initialization. Checked2026-04-24.