Optimization for Machine Learning

A bridge page showing how loss minimization becomes an optimization problem, why gradients matter, and where regularization and stochastic methods enter ML training.
Modified

April 26, 2026

Keywords

optimization, gradient descent, stochastic gradient descent, regularization, objective

1 Application Snapshot

Once a learning problem is written as an empirical risk, the next question is:

how do we actually minimize the objective?

That is where optimization enters machine learning.

This page is the shortest bridge from loss-based ML language to gradient methods, stochastic training, and regularization.

2 Problem Setting

Suppose a model is parameterized by \(\theta\), and we choose an objective

\[ J(\theta) = \hat{R}_n(\theta) + \lambda \Omega(\theta), \]

where:

  • \(\hat{R}_n(\theta)\) is the empirical risk
  • \(\Omega(\theta)\) is a regularization term
  • \(\lambda\) controls the strength of regularization

Training means solving

\[ \min_\theta J(\theta). \]

In simple linear models, this may have a closed-form solution or a very structured numerical method. In larger models, we usually rely on iterative optimization.

3 Why This Math Appears

The optimization layer matters because most ML models are not defined only by what they can represent. They are defined by:

  • an objective function
  • a parameterization
  • an algorithm used to reduce the objective

That reuses math from several places:

  • Linear Algebra: gradients, Jacobians, and linear approximations live in vector spaces
  • Statistics: the objective is built from sample-based quantities
  • Probability: stochastic gradients and random minibatches introduce noise and sampling effects
  • Optimization: descent directions, step sizes, convexity, and constraints control whether training behaves well

So “training a model” is often shorthand for “approximately solving an optimization problem induced by data.”

4 Math Objects In Use

  • parameter vector \(\theta\)
  • objective function \(J(\theta)\)
  • gradient \(\nabla J(\theta)\)
  • step size or learning rate \(\eta\)
  • regularizer \(\Omega(\theta)\)

5 A Small Worked Walkthrough

Take linear regression with squared loss:

\[ J(\beta) = \frac{1}{n}\sum_{i=1}^n (x_i^\top \beta - y_i)^2. \]

This is the same empirical-risk viewpoint from Supervised Learning, Losses, and Empirical Risk, now seen as an optimization problem.

The gradient is

\[ \nabla J(\beta) = \frac{2}{n} X^\top (X\beta - y). \]

So one gradient descent step is

\[ \beta_{t+1} = \beta_t - \eta \nabla J(\beta_t). \]

This update has a direct geometric interpretation:

  • the residual \(X\beta - y\) measures current mismatch
  • multiplication by \(X^\top\) maps that mismatch back into parameter space
  • the update moves parameters in a direction that lowers the objective locally

If we add ridge-style regularization,

\[ J_\lambda(\beta) = \frac{1}{n}\sum_{i=1}^n (x_i^\top \beta - y_i)^2 + \lambda \|\beta\|_2^2, \]

then the gradient changes to

\[ \nabla J_\lambda(\beta) = \frac{2}{n} X^\top (X\beta - y) + 2\lambda \beta. \]

That one extra term already shows why optimization and regularization are tightly linked in ML.

6 Implementation or Computation Note

In practice, large models rarely compute the full gradient on all data at every step.

Instead they use stochastic or minibatch gradients:

\[ g_t \approx \nabla J(\theta_t). \]

Then the update becomes

\[ \theta_{t+1} = \theta_t - \eta g_t. \]

This is cheaper per step and often the only feasible approach at scale, but it introduces noise and tuning questions:

  • batch size
  • learning rate schedule
  • momentum or adaptive updates
  • stopping criteria

So the optimization problem and the optimization algorithm are related, but not identical.

7 Failure Modes

  • the objective may be badly scaled, making gradient steps unstable
  • the learning rate may be too large or too small
  • low training loss can still coexist with poor test performance
  • a regularizer may stabilize fitting but bias the learned solution
  • nonconvex objectives can have many local geometric complications

8 Paper Bridge

9 Sources and Further Reading

  • EE364a: Convex Optimization I - First pass - official optimization course explaining the mathematical layer behind many ML objectives. Checked 2026-04-24.
  • CS229: Machine Learning - First pass - official ML course hub where objective-based training and optimization are central. Checked 2026-04-24.
  • CS 189 Syllabus - Second pass - official Berkeley course framing optimization as part of the standard ML pipeline. Checked 2026-04-24.
  • Mathematics for Machine Learning - Second pass - useful bridge for readers translating gradients and objectives into ML notation. Checked 2026-04-24.
Back to top