Gradient Descent¶
This lane exists as a textbook-breadth advanced lane, not as a claim that competitive programming now needs a full numerical-optimization or deep-learning stack.
The repo's exact first route is intentionally narrow:
- one differentiable loss
- one feature
x - one scalar prediction
wx + b - one full-batch gradient
- one fixed learning rate
- one fixed number of epochs
That means this page is not:
- a deep-learning page
- a stochastic-optimization page
- a momentum / Adam / RMSProp page
- a line-search or second-order optimization page
- a general unconstrained optimization curriculum
It is the first clean algorithmic route for:
- smooth differentiable loss minimization
- batch gradient descent
- learning-rate sensitivity
- one deterministic update loop for linear regression
with Linear Regression Gradient Descent Benchmark as the flagship benchmark.
This page sits next to:
- Machine Learning Algorithms because this lane is the smooth-loss sibling of the perceptron route
- Optimization And Duality because that page owns benchmark / certificate language rather than iterative descent updates
- Algorithm Engineering when the real problem is scaling, numerics, or systems behavior rather than the first update rule itself
At A Glance¶
- Use when:
- you want the first honest route for minimizing one smooth loss by repeated gradient steps
- the benchmark is fixed-step batch gradient descent
- one-feature linear regression is enough to teach the update rule
- the real lesson is
gradient -> step -> iterate, not dual certificates or one closed-form exact solver - Prerequisites:
- Reasoning
- Machine Learning Algorithms only as a compare-point sibling, not a strict prerequisite
- Boundary with nearby pages:
- use this page when the exact first route is
batch gradient descent on squared loss - use Machine Learning Algorithms when the route is perceptron on separable data with mistake-driven updates
- use Optimization And Duality when the missing piece is a benchmark, lower bound, or certificate instead of an update rule
- Strongest cues:
losslearning rateepochsgradientconvergence or divergence depending on step size- Strongest anti-cues:
- the task really wants a closed-form exact solver and the update loop teaches nothing
- the real topic is stochastic gradient descent or modern optimizer variants
- the loss is nondifferentiable and the route really needs subgradients or another method
- the lane would have to pretend this is high-ROI contest material
- Success after studying this page:
- you can write the gradient update rule correctly
- you know why the sign is
-gradient, not+gradient - you know why feature scale and learning rate matter
- you know why this lane is breadth, not CP-core
Problem Model And Notation¶
The exact starter route in this repo uses one-feature linear regression:
The loss is the half mean squared error:
Its batch gradients are:
The update step with learning rate alpha > 0 is:
The exact repo route starts from:
w = 0b = 0
and repeats the batch update for a fixed number of epochs.
This is deliberate.
Why:
- it keeps the route deterministic
- it keeps the update rule completely explicit
- it avoids pretending the first lane is already about optimizer engineering or nonconvex training
Why This Exists Next To Machine Learning Algorithms¶
Machine Learning Algorithms teaches:
- hard classification
- mistake-driven updates
- linear separability
- the perceptron
This page teaches a different first lens:
- smooth loss
- full-batch gradients
- learning-rate choice
- descent over a real-valued objective
So the division of labor is:
- machine-learning page: perceptron and separability
- gradient-descent page: smooth loss minimization through deterministic gradient steps
Core Route In This Repo¶
The exact route is:
- initialize
w = 0,b = 0 - for each epoch:
- compute the full-batch gradients on all examples
- update
wandbonce - return the final parameters
The exact starter contract is intentionally narrow:
- one feature only
- one affine model
wx + b - one squared loss only
- one fixed learning rate
- one full-batch update per epoch
The exact non-promises matter:
- no stochastic or mini-batch route
- no momentum / Adam / RMSProp
- no logistic regression
- no automatic convergence detection
- no claim that this is the right contest tool for ordinary offline problems
Core Invariants¶
1. The Update Goes Against The Gradient¶
The gradient points toward local increase.
So the step must be:
Using +gradient is ascent, not descent.
2. Batch Gradient Means All Examples Contribute Before Each Step¶
This exact route is not online and not stochastic.
Each epoch:
- scans every training example
- sums their contribution
- performs one update
That full-batch invariant is the whole starter route.
3. The Learning Rate Is Part Of The Algorithm¶
Too small:
- convergence is slow
Too large:
- the loss may oscillate or diverge
The first route must say this out loud because the update rule is meaningless without step-size discipline.
Worked Example: Linear Regression Gradient Descent Benchmark¶
Use Linear Regression Gradient Descent Benchmark.
Take points on the exact line:
(0, 1)(1, 3)(2, 5)(3, 7)
The ideal parameters are:
w = 2b = 1
Starting from zero and using a moderate learning rate, batch gradient descent moves the parameters toward that optimum over repeated epochs.
The benchmark is intentionally about:
- writing the gradient correctly
- updating in the right direction
- seeing stable convergence on one smooth, tiny loss
not about modern large-scale ML practice.
Variant Chooser¶
| Situation | Best first move | Why it fits | Where it fails |
|---|---|---|---|
| one smooth differentiable loss and a deterministic update loop are the whole lesson | use this lane | gradient descent is exactly the intended first route | weak if the real task needs certificates, exact closed forms, or advanced optimizers |
| one hard linear classifier with separable labels is the intended benchmark | use Machine Learning Algorithms | the perceptron is the right first route there | weak if the real structure is smooth loss minimization |
| the proof is benchmark / certificate driven | use Optimization And Duality | that page owns the witness language | weak if the missing piece is still just the update rule |
| the real problem is scaling, numerics, or systems performance | use Algorithm Engineering | that page is better for engineering questions | weak if the route is still first-order optimization basics |
Complexity And Tradeoffs¶
For n samples and T epochs, the exact route is:
- time =
O(Tn) - memory =
O(1)beyond the input arrays
The main tradeoff is:
- very explicit smooth-loss update mechanics versus
- zero pretense that this is already modern optimization practice
That is why the lane belongs in advanced and breadth.
Main Traps¶
- using
+gradientinstead of-gradient - forgetting the bias gradient
- mixing batch gradient descent with stochastic or mini-batch language
- choosing a learning rate that is so large the benchmark diverges
- pretending this first route already covers logistic regression or Adam
Reopen Path¶
- Topic page: Gradient Descent
- Practice ladder: Gradient Descent ladder
- Starter template: gradient-descent-linear-regression-1d.cpp
- Notebook refresher: Gradient Descent hot sheet
- Compare points:
- Machine Learning Algorithms
- Optimization And Duality
- Linear Regression Gradient Descent Benchmark