Oversmoothing, Depth, and Graph Sampling

A bridge page showing why repeated graph propagation can wash node representations together, why deeper GNNs become hard to train, and how graph sampling helps with scale and locality.
Modified

April 26, 2026

Keywords

oversmoothing, graph neural networks, message passing, GraphSAGE, graph sampling

1 Application Snapshot

Deeper models help in many areas of ML, but graph neural networks often hit a different regime:

more layers can make node representations too similar

That phenomenon is called oversmoothing.

At the same time, deeper message passing expands the effective neighborhood of each node very quickly, so computation and noise both grow. That is one reason graph sampling became a core practical tool.

2 Problem Setting

A simple graph propagation layer can be written as

\[ H^{(\ell+1)} = P H^{(\ell)} W^{(\ell)}, \]

where:

  • \(H^{(\ell)}\) is the node-feature matrix at layer \(\ell\)
  • \(P\) is a normalized propagation matrix
  • \(W^{(\ell)}\) is a learned weight matrix

If we ignore the learned weights for a moment and focus only on propagation, repeated layers look like

\[ H^{(L)} \approx P^L H^{(0)}. \]

So the same question from graph diffusion returns:

what happens when we keep applying the same graph-mixing operator?

3 Why This Math Appears

This page lives directly on top of earlier pages:

So oversmoothing is not an isolated GNN pathology. It is what happens when a propagation operator keeps averaging away distinctions that the task still needs.

4 Math Objects In Use

  • graph propagation matrix \(P\)
  • repeated powers \(P^L\)
  • node-feature matrix \(H\)
  • neighborhood radius induced by depth
  • sampling or aggregation budget per node

5 A Small Worked Walkthrough

Take the same 3-node line-style propagation matrix used for diffusion intuition:

\[ P = \begin{bmatrix} \tfrac{1}{2} & \tfrac{1}{2} & 0 \\ \tfrac{1}{3} & \tfrac{1}{3} & \tfrac{1}{3} \\ 0 & \tfrac{1}{2} & \tfrac{1}{2} \end{bmatrix}. \]

Start with one scalar feature at each node:

\[ H^{(0)} = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}. \]

After one propagation step,

\[ H^{(1)} = P H^{(0)} = \begin{bmatrix} \tfrac{1}{2} \\ \tfrac{1}{3} \\ 0 \end{bmatrix}. \]

After two steps,

\[ H^{(2)} = P H^{(1)} = \begin{bmatrix} \tfrac{5}{12} \\ \tfrac{5}{18} \\ \tfrac{1}{6} \end{bmatrix}. \]

After many steps, the entries keep moving toward a shared smoothed pattern. The graph is mixing information across nodes.

That mixing is useful at first:

  • local noise is reduced
  • neighboring structure is incorporated

But if we keep going, node representations can become too similar to distinguish. That is oversmoothing.

The second practical issue is neighborhood explosion. If each node aggregates from, say, \(d\) neighbors per layer, then \(L\) layers can expose information from roughly \(d^L\) local paths. For large graphs, this quickly becomes expensive.

This is why methods like GraphSAGE sample only part of each neighborhood during training. Instead of aggregating all neighbors at each layer, the model may sample a fixed number such as:

  • 25 neighbors at layer 1
  • 10 neighbors at layer 2

This does not remove the underlying diffusion logic, but it keeps the computation manageable and the effective receptive field better controlled.

6 Implementation or Computation Note

In practical GNN design, depth is tied to three competing effects:

  • more layers let information travel farther
  • more layers can wash node features together
  • more layers increase memory and neighborhood cost

Common mitigations include:

  • residual or skip connections
  • normalization
  • jumping-knowledge style architectures
  • neighborhood sampling
  • restricting depth and using stronger features

Another response is to change the graph itself. Graph Rewiring, Homophily, and Heterophily looks at that option directly.

It is also important not to confuse this page’s main failure mode with Long-Range Dependence and Oversquashing in Graphs. Oversmoothing is about node states becoming too similar; oversquashing is about too much distant information being forced through too small a bottleneck.

So “make the GNN deeper” is not a neutral choice. It changes both the operator and the geometry of the learned representations.

7 Failure Modes

  • assuming deeper message passing is always better
  • confusing oversmoothing with every possible GNN failure mode
  • ignoring neighborhood explosion and training cost on large graphs
  • treating graph sampling as only an implementation trick rather than a modeling choice about locality
  • forgetting that some recent results show oversmoothing is nuanced and depends on operator and initialization, not only on layer count

8 Paper Bridge

9 Sources and Further Reading

Back to top