Bayesian Optimization and Surrogate Modeling

A bridge page showing how Bayesian optimization uses a probabilistic surrogate and an acquisition function to search expensive black-box objectives with fewer evaluations.
Modified

April 26, 2026

Keywords

bayesian optimization, surrogate model, acquisition function, gaussian process, hyperparameter optimization

1 Application Snapshot

Bayesian optimization is designed for objectives that are:

  • expensive to evaluate
  • noisy or black-box
  • available only through a limited budget of trials

Instead of evaluating the true objective everywhere, it repeats a smaller loop:

  1. fit a surrogate model to the observations so far
  2. use that surrogate to score where it is worth sampling next
  3. evaluate the real objective there
  4. update and repeat

So the central ML idea is:

use a model of the objective to decide how to spend the next experiment

2 Problem Setting

Suppose we want to maximize an expensive objective

\[ f(x) \]

over a search space of configurations \(x\).

In ML, \(x\) might be:

  • hyperparameters
  • architecture settings
  • prompting or retrieval settings
  • simulator or experimental parameters

and \(f(x)\) might be:

  • validation accuracy
  • reward
  • sample efficiency
  • scientific yield from a costly experiment

We observe data

\[ \mathcal{D}_t = \{(x_i, y_i)\}_{i=1}^t \]

with \(y_i\) equal to a noisy evaluation of \(f(x_i)\).

Bayesian optimization fits a surrogate posterior over the objective, often summarized by a predictive mean \(\mu_t(x)\) and uncertainty \(\sigma_t(x)\), and then chooses the next point by optimizing an acquisition function

\[ \alpha_t(x). \]

3 Why This Math Appears

This page ties together three earlier threads:

Bayesian optimization is one of the cleanest examples of ML using probability and optimization together in a sequential decision loop.

4 Math Objects In Use

  • black-box objective \(f(x)\)
  • observation history \(\mathcal{D}_t\)
  • surrogate posterior mean \(\mu_t(x)\)
  • surrogate uncertainty \(\sigma_t(x)\)
  • acquisition function \(\alpha_t(x)\)
  • incumbent best value or best observed point

Common acquisition patterns include:

  • upper confidence bound (UCB)
  • expected improvement (EI)
  • probability of improvement (PI)

5 A Small Worked Walkthrough

Suppose we are tuning one hyperparameter \(x\) and want to maximize validation accuracy.

After a few evaluations, the surrogate gives:

Candidate \(x\) Predictive mean \(\mu(x)\) Predictive std. dev. \(\sigma(x)\)
0.01 0.82 0.01
0.05 0.80 0.04
0.20 0.76 0.10

If we use an upper-confidence rule

\[ \alpha(x) = \mu(x) + \beta \sigma(x) \]

with \(\beta = 1\), then

\[ \alpha(0.01)=0.83,\qquad \alpha(0.05)=0.84,\qquad \alpha(0.20)=0.86. \]

So Bayesian optimization would pick \(x=0.20\) next, even though it does not have the highest current mean.

Why?

  • \(x=0.01\) looks good, but we already know it fairly well
  • \(x=0.20\) looks worse on the current mean, but its uncertainty is large
  • the acquisition function values learning opportunity, not only current best guess

Now suppose the real evaluation at \(x=0.20\) comes back as \(0.88\). The surrogate updates, and the next acquisition step will usually focus around that region with a different exploration-exploitation balance.

That is the core BO loop:

  • fit beliefs about the objective
  • spend the next evaluation where the surrogate says it is most valuable
  • update and repeat

6 Implementation or Computation Note

Bayesian optimization tends to work best when:

  • each evaluation is genuinely expensive
  • the budget of evaluations is small to moderate
  • uncertainty matters
  • the search space is not too high-dimensional without extra structure

In practice, a BO pipeline usually includes:

  1. an initial design, often random or Sobol points
  2. surrogate fitting after each batch
  3. acquisition optimization to choose the next candidate
  4. optional handling of noise, constraints, or batched evaluations

A common first surrogate is a Gaussian process, but modern systems also use:

  • multi-task surrogates
  • multi-fidelity surrogates
  • trust-region variants for higher dimensions
  • discrete or mixed-space adaptations

So the phrase Bayesian optimization names a family of sequential design methods, not just one fixed algorithm.

7 Failure Modes

  • using BO when the objective is cheap enough that random search or direct optimization is simpler
  • treating surrogate predictions as truth rather than as uncertain summaries
  • ignoring the difficulty of optimizing the acquisition function itself
  • pushing classical GP BO into very high-dimensional spaces without structure
  • forgetting that noisy objectives may need repeated trials or careful variance modeling
  • optimizing only the surrogate mean and calling it Bayesian optimization

One practical sanity check is:

if each trial is cheap, Bayesian optimization is often solving the wrong problem elegantly

8 Paper Bridge

9 Sources and Further Reading

Back to top