Convex Functions and Subgradients
convex function, subgradient, subdifferential, epigraph, first-order condition
1 Role
This page is the analytic heart of the early optimization module.
It explains what kind of objective functions let local first-order information control global behavior, and how subgradients extend that story to nonsmooth problems.
2 First-Pass Promise
Read this page after Convex Sets and Separation.
If you stop here, you should still understand:
- what a convex function is
- why tangent information gives a global lower bound
- what a subgradient is at a corner or kink
- why these ideas matter for regularization, hinge-style losses, and first-order optimization
3 Why It Matters
Optimization is not just about feasible regions. It is also about objective shape.
Convex functions are special because they have no deceptive valleys. If you move between two points on the graph, the function lies below the chord joining their heights. That geometric statement leads to powerful optimization consequences:
- every local minimum is a global minimum
- tangent information becomes globally meaningful
- many nonsmooth objectives still admit useful first-order certificates
This is exactly why convex losses, penalties, and regularizers are so important in machine learning, statistics, and engineering design. They allow meaningful optimization statements even when the problem is not smooth everywhere.
4 Prerequisite Recall
- a convex set contains every line segment between its points
- the epigraph of a function is the set of points lying on or above its graph
- a hyperplane can support or separate convex sets
- gradients describe first-order change when a function is differentiable
5 Intuition
For sets, convexity meant no inward dents.
For functions, convexity means the graph bends upward rather than downward. Averaging inputs never gives a function value worse than averaging outputs:
function of an average <= average of function values
That single inequality is why convex objectives behave so differently from arbitrary ones.
If the function is smooth, every tangent line or tangent plane sits below the graph. So the gradient at one point already gives a global lower-bounding linear model.
If the function has a corner, there may be many supporting lines instead of one tangent line. Each valid supporting slope is a subgradient. The whole collection is the subdifferential.
So the progression is:
- convex set -> supporting hyperplanes
- convex function -> supporting affine lower bounds
- nonsmooth convex function -> subgradients and subdifferentials
6 Formal Core
Definition 1 (Definition) A function \(f : \mathbb{R}^n \to \mathbb{R}\cup\{+\infty\}\) is convex if its domain is convex and for all \(x,y\) in the domain and all \(\theta \in [0,1]\),
\[ f(\theta x + (1-\theta)y) \le \theta f(x) + (1-\theta)f(y). \]
This is the defining Jensen-style inequality for convexity.
One very important geometric equivalence is:
- \(f\) is convex if and only if its epigraph
\[ \operatorname{epi}(f) = \{(x,t) : t \ge f(x)\} \]
is a convex set.
Theorem 1 (First-Order Characterization) If \(f\) is differentiable and convex, then for all \(x,y\),
\[ f(y) \ge f(x) + \nabla f(x)^\top (y-x). \]
So the first-order Taylor model at \(x\) is a global affine lower bound for the function.
This is one of the most important inequalities in optimization. It says the gradient of a convex function is not merely local information.
Definition 2 (Definition) A vector \(g\) is a subgradient of a convex function \(f\) at \(x\) if for all \(y\),
\[ f(y) \ge f(x) + g^\top (y-x). \]
The set of all subgradients at \(x\) is the subdifferential, written \(\partial f(x)\).
When \(f\) is differentiable at \(x\), the subdifferential collapses to one point:
\[ \partial f(x)=\{\nabla f(x)\}. \]
Proposition 1 (Subgradient Optimality Condition) For an unconstrained convex problem
\[ \min_x f(x), \]
a point \(x^\star\) is globally optimal if and only if
\[ 0 \in \partial f(x^\star). \]
For differentiable convex functions, this reduces to the familiar condition \(\nabla f(x^\star)=0\).
7 Worked Example
Take the function
\[ f(x)=|x|. \]
This is one of the simplest convex but nonsmooth functions.
First, why is it convex? One reason is geometric: its graph is a V shape, always bending upward. Another reason is analytic: it satisfies the convexity inequality, and its epigraph is the convex cone
\[ \{(x,t): t\ge |x|\}. \]
Now examine its subgradients.
For \(x>0\), the function is differentiable and
\[ \partial f(x)=\{1\}. \]
For \(x<0\), it is also differentiable and
\[ \partial f(x)=\{-1\}. \]
At the kink \(x=0\), there is no ordinary derivative, but there are many supporting lines. A number \(g\) is a subgradient at \(0\) if for all \(y\),
\[ |y| \ge g y. \]
This holds exactly when
\[ g \in [-1,1]. \]
So
\[ \partial f(0) = [-1,1]. \]
Now check optimality. Since
\[ 0 \in [-1,1] = \partial f(0), \]
the point \(x^\star=0\) is a global minimizer of \(|x|\).
This tiny example contains the main picture:
- convexity gives global structure
- smooth points have one supporting slope
- corners have many supporting slopes
- minimization means finding a point whose subdifferential contains zero
8 Computation Lens
Convexity and subgradients immediately affect algorithms.
If the objective is smooth and convex, gradient descent makes sense because the gradient points toward a global underestimator.
If the objective is convex but nonsmooth, we often use:
- subgradient methods
- proximal methods
- splitting methods
- smoothing or reformulation tricks
This is why functions like:
- \(\ell_1\) penalties
- hinge loss
- max-type objectives
- norm objectives
still belong naturally to the optimization toolbox even when ordinary derivatives fail at some points.
9 Application Lens
Many of the most important ML and statistics objectives are built from convex functions and subgradients.
Examples include:
- \(\ell_1\) regularization, where the absolute value promotes sparsity
- hinge loss in margin-based classification
- norms and max-functions in robust or structured optimization
The reason these models remain workable is that convexity gives global geometry, and subgradients preserve first-order reasoning at nonsmooth points.
For a bridge to model complexity and regularization, see Regularization, Implicit Bias, and Model Complexity.
10 Stop Here For First Pass
If you can now explain:
- what convexity means for a function
- why differentiable convex functions admit global affine lower bounds
- what a subgradient is at a kink
- why \(0 \in \partial f(x^\star)\) is the right nonsmooth optimality condition
then this page has done its main job.
11 Go Deeper
The strongest next pages are:
- Unconstrained First-Order Methods for the algorithmic version of the first-order story
- Constrained Optimization, KKT, and Lagrangians for the constrained first-order version
- Regularization, Implicit Bias, and Model Complexity for a machine-learning bridge
12 Optional Paper Bridge
- Adaptive Subgradient Methods for Online Learning and Stochastic Optimization -
Paper bridge- watch how subgradient ideas lead to practical large-scale learning algorithms like AdaGrad. Checked2026-04-24. - 1-norm Support Vector Machines -
Paper bridge- a clean bridge from convex nonsmooth objectives to sparse margin-based learning. Checked2026-04-24.
13 Optional After First Pass
If you want more depth after the first pass:
- compare ordinary gradients and subgradients on absolute value, hinge loss, and max-functions
- read the convex-functions lecture in Stanford SEE EE364A and watch how epigraphs and supporting hyperplanes enter the proofs
- look at Stanford EE364m if you want a more mathematical convex-analysis perspective
14 Common Mistakes
- confusing convexity of the domain with convexity of the function
- assuming every convex function must be differentiable
- treating the gradient inequality as local rather than global
- thinking the subdifferential at a nonsmooth point is an arbitrary set rather than the set of supporting slopes
- forgetting that \(0 \in \partial f(x^\star)\) is sufficient for global optimality only because the problem is convex
15 Exercises
- Show that \(f(x)=x^2\) is convex by checking the first-order inequality.
- Compute the subdifferential of \(f(x)=|x|\) at \(x=0\) directly from the definition.
- Explain why the function \(f(x)=\max\{x,0\}\) is convex and identify \(\partial f(0)\).
16 Sources and Further Reading
- Stanford EE364a: Convex Optimization I -
First pass- current official course backbone for convex functions, optimality, and modeling. Checked2026-04-24. - Stanford Engineering Everywhere: EE364A -
First pass- official lecture archive with direct material on convex functions, epigraphs, and first-order conditions. Checked2026-04-24. - Convex Optimization by Boyd and Vandenberghe -
First pass- canonical official text for convex functions, subgradients, and global first-order reasoning. Checked2026-04-24. - MIT 6.253: Convex Analysis and Optimization -
Second pass- good for a more mathematical convex-analysis treatment. Checked2026-04-24. - Stanford EE364m: The Mathematics of Convexity -
Second pass- current official course for a deeper mathematical treatment of convex sets and functions. Checked2026-04-24. - Adaptive Subgradient Methods for Online Learning and Stochastic Optimization -
Paper bridge- a strong path from convex analysis into large-scale learning algorithms. Checked2026-04-24.
Sources checked online on 2026-04-24:
- Stanford EE364a course page
- Stanford SEE EE364A page
- Boyd and Vandenberghe convex optimization book page
- MIT 6.253 course page
- Stanford EE364m page
- JMLR AdaGrad paper page