Unconstrained First-Order Methods
gradient descent, first-order methods, smoothness, strong convexity, stochastic gradient descent
1 Role
This page is the first algorithm page of the optimization module.
It explains how the geometry of convex functions turns into practical iterative methods that use only gradient or subgradient information.
2 First-Pass Promise
Read this page after Convex Functions and Subgradients.
If you stop here, you should still understand:
- the basic gradient descent update
- why step size matters
- what smoothness buys you
- why strong convexity improves convergence
- how SGD and adaptive methods fit into the same first-order family
3 Why It Matters
Optimization becomes useful only when the mathematics can be turned into an algorithm.
First-order methods do exactly that. They use local linear information, such as gradients or subgradients, to produce a sequence of improving iterates. This makes them attractive when:
- the variable dimension is large
- Hessians are too expensive to form or solve with
- data arrive in streams or minibatches
- the loss has simple first-order structure but complicated global geometry
That is why first-order methods dominate modern machine learning and large-scale scientific optimization. Even when second-order or interior-point methods exist, first-order methods are often the only practical baseline at scale.
4 Prerequisite Recall
- for a differentiable convex function, the gradient gives a global affine lower bound
- a subgradient plays the same supporting-role at nonsmooth points
- unconstrained optimality for differentiable convex problems is \(\nabla f(x^\star)=0\)
- unconstrained optimality for nonsmooth convex problems is \(0 \in \partial f(x^\star)\)
5 Intuition
If a differentiable convex function has gradient \(\nabla f(x)\) at the current point, then that gradient points in the direction of steepest local increase.
So the simplest idea is:
move a little in the opposite direction
This produces the gradient descent update.
The subtlety is the step size. If the step is too small, progress is slow. If it is too large, the method can oscillate or even diverge. Smoothness controls how inaccurate the local linear model can be, and therefore how aggressively we are allowed to step.
Strong convexity adds another ingredient: curvature that rules out flat valleys. This makes the method contract toward the minimizer at a geometric rate instead of drifting there slowly.
6 Formal Core
Definition 1 (Definition) For a differentiable objective \(f : \mathbb{R}^n \to \mathbb{R}\), the gradient descent update is
\[ x_{k+1} = x_k - \eta_k \nabla f(x_k), \]
where \(\eta_k > 0\) is the step size, also called the learning rate.
Definition 2 (Definition) A differentiable function \(f\) is L-smooth if its gradient is Lipschitz:
\[ \|\nabla f(x) - \nabla f(y)\|_2 \le L \|x-y\|_2 \quad \text{for all } x,y. \]
Intuitively, smoothness bounds how quickly the gradient can change.
For first-pass optimization, the most important consequence of smoothness is the descent lemma.
Theorem 1 (Descent Guarantee) If \(f\) is differentiable and \(L\)-smooth, then for every \(x\) and every step size \(\eta\),
\[ f(x-\eta \nabla f(x)) \le f(x) - \eta \left(1-\frac{L\eta}{2}\right)\|\nabla f(x)\|_2^2. \]
In particular, if \(0<\eta\le 1/L\), then
\[ f(x-\eta \nabla f(x)) \le f(x) - \frac{\eta}{2}\|\nabla f(x)\|_2^2. \]
So the objective decreases whenever the gradient is nonzero.
Definition 3 (Definition) A differentiable function \(f\) is \mu-strongly convex if for all \(x,y\),
\[ f(y) \ge f(x) + \nabla f(x)^\top (y-x) + \frac{\mu}{2}\|y-x\|_2^2 \]
for some \(\mu>0\).
Strong convexity means the function has uniform curvature from below.
Proposition 1 (Convergence Pattern) For smooth convex minimization, gradient descent typically achieves sublinear convergence in function value.
If the objective is both \(L\)-smooth and \(\mu\)-strongly convex, then fixed-step gradient descent with an appropriate step size converges linearly (geometrically) to the unique minimizer.
This is the main first-pass takeaway:
- convex + smooth gives reliable descent
- convex + smooth + strongly convex gives much faster contraction
7 Worked Example
Consider the quadratic function
\[ f(x_1,x_2) = \frac12(x_1^2 + 9x_2^2). \]
Its gradient is
\[ \nabla f(x_1,x_2) = \begin{bmatrix} x_1 \\ 9x_2 \end{bmatrix}. \]
So gradient descent with fixed step size \(\eta\) becomes
\[ \begin{bmatrix} x_{1,k+1} \\ x_{2,k+1} \end{bmatrix} = \begin{bmatrix} x_{1,k} \\ x_{2,k} \end{bmatrix} - \eta \begin{bmatrix} x_{1,k} \\ 9x_{2,k} \end{bmatrix} = \begin{bmatrix} (1-\eta)x_{1,k} \\ (1-9\eta)x_{2,k} \end{bmatrix}. \]
This makes the dynamics completely visible.
If \(\eta\) is too large, the second coordinate can overshoot badly because the curvature in the \(x_2\) direction is much larger.
Take \(\eta = 0.1\). Then
\[ x_{1,k+1}=0.9x_{1,k}, \qquad x_{2,k+1}=0.1x_{2,k}. \]
So the second coordinate shrinks much faster than the first.
Take \(\eta = 0.25\). Then
\[ x_{1,k+1}=0.75x_{1,k}, \qquad x_{2,k+1}=-1.25x_{2,k}. \]
Now the second coordinate flips sign and grows in magnitude, so the method is unstable in that direction.
This is the key lesson:
- anisotropic curvature makes some directions harder than others
- the stable step size is controlled by the largest curvature
- conditioning matters even in very simple convex problems
That is why optimization papers spend so much time talking about smoothness constants, condition numbers, preconditioning, and adaptive steps.
8 Computation Lens
In practice, unconstrained first-order methods form a family rather than one algorithm:
gradient descent: full deterministic gradientstochastic gradient descent: noisy gradient estimates from samples or minibatchessubgradient methods: used when the objective is convex but nonsmoothmomentum and accelerated methods: add memory or extrapolationadaptive methodssuch as AdaGrad: rescale coordinates based on past gradients
The unifying theme is the same:
use first-order information cheaply, many times
This is also where the math of the previous page becomes operational:
- smoothness controls stable steps
- strong convexity improves rates
- subgradients keep the method meaningful when corners appear
9 Application Lens
Machine learning training loops are the most visible application of first-order methods.
Empirical risk minimization over large datasets rarely permits exact full-batch second-order solves, so practitioners use SGD, minibatch gradients, momentum variants, or adaptive methods. The theory you need to read these methods responsibly begins here:
- what the update is
- what assumptions justify descent
- how curvature affects speed
- why nonsmooth penalties often require subgradient or proximal variants
For the broader training-side view, see Optimization for Machine Learning.
10 Stop Here For First Pass
If you can now explain:
- the gradient descent update
- why the step size controls stability and speed
- how smoothness gives a descent guarantee
- why strong convexity leads to faster convergence
- how SGD and subgradient methods fit into the same first-order family
then this page has done its main job.
11 Go Deeper
The strongest next pages are:
- Optimization for Machine Learning for the ML training perspective
- Duality and Certificates for the lower-bound and certificate side of optimization
- Constrained Optimization, KKT, and Lagrangians for the constrained first-order story
12 Optional Paper Bridge
- Adaptive Subgradient Methods for Online Learning and Stochastic Optimization -
Paper bridge- the canonical modern bridge from first-order convex analysis to adaptive large-scale learning algorithms. Checked2026-04-24. - RSG: Beating Subgradient Method without Smoothness and Strong Convexity -
Paper bridge- a nice bridge showing how rate improvements still depend on geometric structure beyond the plain textbook subgradient method. Checked2026-04-24.
13 Optional After First Pass
If you want more depth after the first pass:
- compare gradient descent on a well-conditioned quadratic and an ill-conditioned quadratic
- read the unconstrained minimization lectures in Stanford SEE EE364A
- look at MIT 6.7220/15.084 Nonlinear Optimization for a more current optimization-course perspective on gradient descent and SGD
14 Common Mistakes
- thinking the gradient always points directly to the minimizer
- using a large step size because it looks faster locally
- forgetting that smoothness assumptions justify the descent guarantee
- assuming subgradients behave identically to gradients at nonsmooth points
- treating stochastic gradients as exact gradients rather than noisy estimators
15 Exercises
- Apply one step of gradient descent with step size \(\eta\) to \(f(x)=x^2\) and write the resulting update rule.
- For the worked example, explain why the second coordinate controls the stable step size.
- In words, explain the difference between sublinear and linear convergence.
16 Sources and Further Reading
- Stanford EE364a: Convex Optimization I -
First pass- current official course backbone for gradient methods in convex optimization. Checked2026-04-24. - Stanford Engineering Everywhere: EE364A Lecture 16 -
First pass- official lecture archive page focused on unconstrained minimization, gradient descent, strong convexity, and Newton methods. Checked2026-04-24. - Convex Optimization by Boyd and Vandenberghe -
First pass- canonical official text for descent methods and smooth convex minimization. Checked2026-04-24. - MIT 6.253: Convex Analysis and Optimization -
Second pass- useful when you want more mathematical depth behind rates and convex optimization algorithms. Checked2026-04-24. - MIT 6.7220/15.084 Nonlinear Optimization -
Second pass- current official optimization course with direct lecture material on gradient descent and stochastic gradient descent. Checked2026-04-24. - Adaptive Subgradient Methods for Online Learning and Stochastic Optimization -
Paper bridge- a clean route from textbook first-order methods to adaptive large-scale learning. Checked2026-04-24.
Sources checked online on 2026-04-24:
- Stanford EE364a course page
- Stanford SEE EE364A lecture 16 page
- Boyd and Vandenberghe convex optimization book page
- MIT 6.253 course page
- MIT 6.7220/15.084 course page
- JMLR AdaGrad page