Backpropagation and Computation Graphs

A bridge page showing how neural-network training becomes gradient computation on a computation graph, and why backpropagation is the efficient reverse-mode use of the chain rule.

Modified

April 26, 2026

Keywords

backpropagation, computation graph, chain rule, neural networks, reverse mode autodiff

1 Application Snapshot

Deep learning became practical not only because we can write large models, but because we can compute their gradients efficiently.

That efficiency comes from two connected ideas:

represent the model and loss as a computation graph
apply the chain rule in reverse to reuse intermediate derivative information

That reverse pass is backpropagation.

2 Problem Setting

Suppose a model takes an input \(x\), produces a prediction \(\hat{y}\), and then incurs a loss

\[ L(\hat{y}, y). \]

Even in a small neural network, the prediction is built by many nested operations:

\[ x \;\to\; z_1 \;\to\; a_1 \;\to\; z_2 \;\to\; a_2 \;\to\; L. \]

Training requires gradients of the loss with respect to parameters such as weight matrices and bias vectors.

The naive approach would differentiate each parameter separately from scratch. Backpropagation avoids that waste by reusing local derivatives along the graph.

3 Why This Math Appears

This page sits exactly at the meeting point of several math layers already on the site:

Linear Algebra: layers are built from affine maps and matrix-vector products
Optimization: training updates need gradients of the objective
Logic and proofs: the dependency structure matters, and hidden assumptions about composition must be made explicit
Empirical risk: the loss being differentiated is part of the objective introduced in Supervised Learning, Losses, and Empirical Risk

So backpropagation is not a mysterious deep-learning ritual. It is a structured application of:

composition of maps
the chain rule
efficient bookkeeping on a directed acyclic graph

4 Math Objects In Use

parameter matrices and bias vectors
intermediate activations
scalar loss \(L\)
local derivatives at each node
gradients such as \(\nabla_W L\) and \(\nabla_b L\)

5 A Small Worked Walkthrough

Consider the smallest useful scalar network:

\[ u = wx + b, \qquad h = \sigma(u), \qquad L = \frac{1}{2}(h-y)^2. \]

This is already a computation graph:

affine step: \(u = wx+b\)
nonlinearity: \(h=\sigma(u)\)
loss: \(L=\frac{1}{2}(h-y)^2\)

To update \(w\) and \(b\), we need \(\frac{\partial L}{\partial w}\) and \(\frac{\partial L}{\partial b}\).

Backpropagation starts at the loss and moves backward.

First,

\[ \frac{\partial L}{\partial h} = h-y. \]

Then move through the nonlinearity:

\[ \frac{\partial L}{\partial u} = \frac{\partial L}{\partial h}\frac{\partial h}{\partial u} = (h-y)\sigma'(u). \]

Now differentiate the affine step:

\[ \frac{\partial u}{\partial w} = x, \qquad \frac{\partial u}{\partial b} = 1. \]

\[ \frac{\partial L}{\partial w} = \frac{\partial L}{\partial u}\frac{\partial u}{\partial w} = (h-y)\sigma'(u)\,x \]

and

\[ \frac{\partial L}{\partial b} = \frac{\partial L}{\partial u}\frac{\partial u}{\partial b} = (h-y)\sigma'(u). \]

The key point is not the formula itself. It is the pattern:

compute local derivatives once
propagate the loss sensitivity backward
reuse the intermediate quantity \(\frac{\partial L}{\partial u}\) for multiple parameters

That reuse is what scales to large networks.

6 Implementation or Computation Note

Modern frameworks usually perform this bookkeeping automatically, but the underlying mechanism is the same.

A computation graph stores:

forward values
graph dependencies
local derivative rules

Then reverse-mode automatic differentiation propagates adjoints backward from the loss.

This is why the cost of computing all parameter gradients is often only a constant factor larger than the cost of the forward evaluation itself, rather than a separate full derivative computation per parameter.

In practice, this introduces engineering constraints too:

intermediate activations must often be stored for the backward pass
memory can become a bottleneck in large models
unstable derivatives can lead to exploding or vanishing gradients

7 Failure Modes

treating backpropagation as if it were different from the chain rule rather than an efficient implementation of it
forgetting that gradients depend on the full composition, not only on one local layer
numerical instability from exploding or vanishing gradients
confusing the computation graph of the model with the training loop around it
assuming autodiff removes the need to understand the objective and parameterization

8 Paper Bridge

CS229: Additional Notes on Backpropagation - First pass - official focused note on backprop as the practical gradient engine of neural networks. Checked 2026-04-24.
CS229 Deep Learning Notes - Second pass - official notes connecting backpropagation to larger network structures and autodiff language. Checked 2026-04-24.

9 Sources and Further Reading

CS229: Additional Notes on Backpropagation - First pass - official Stanford note that explains the algorithm directly. Checked 2026-04-24.
CS229 Deep Learning Notes - First pass - official notes connecting computation graphs, chain rule structure, and neural-network training. Checked 2026-04-24.
CS 189 course page - Second pass - official Berkeley ML course showing where neural networks and backprop live in a math-heavy ML curriculum. Checked 2026-04-24.
Dive into Deep Learning - Second pass - broad open deep learning text useful once the computation-graph picture is clear. Checked 2026-04-24.