Value Iteration, Policy Iteration, and Approximate Dynamic Programming

How Bellman fixed points become practical algorithms through value iteration, policy iteration, and approximation.
Modified

April 26, 2026

Keywords

value iteration, policy iteration, approximate dynamic programming, Bellman operator, reinforcement learning

1 Role

This is the fourth page of the Stochastic Control and Dynamic Programming module.

Its job is to answer the next natural question after Bellman fixed points:

  • if the optimal value is defined by a fixed-point equation, how do we compute it?

This is where the module turns the Bellman equation into actual algorithms.

2 First-Pass Promise

Read this page after Infinite-Horizon Value Functions, Bellman Equations, and Contractions.

If you stop here, you should still understand:

  • what value iteration does
  • what policy iteration does
  • why both algorithms are rooted in Bellman structure
  • why approximation becomes unavoidable in large state spaces

3 Why It Matters

The Bellman equation is elegant, but by itself it is only a characterization.

It tells us what the optimal value function must satisfy.

It does not yet tell us how to find that object at scale.

This page introduces the two classic algorithmic answers:

  • value iteration: repeatedly apply the Bellman operator
  • policy iteration: alternate between evaluating a policy and improving it

Then it makes one more step that matters for modern practice:

  • when the state space is large or continuous, exact dynamic programming becomes impossible
  • so we approximate value functions, policies, or Bellman updates

That is the entry point to approximate dynamic programming and many RL methods.

4 Prerequisite Recall

  • the Bellman optimality operator T maps one candidate value function to another
  • under discounted infinite-horizon assumptions, T is a contraction
  • the optimal value function V^* is the unique fixed point of T
  • a greedy action with respect to V^* is optimal

5 Intuition

5.1 Value Iteration Repeats The Bellman Update

If the Bellman operator has a unique fixed point and shrinks errors, then a simple idea becomes natural:

  • start with any guess V_0
  • apply T
  • apply T again
  • keep going

The contraction theorem says these repeated updates move toward V^*.

5.2 Policy Iteration Solves Easier Problems Along The Way

Instead of updating the value function directly, policy iteration keeps a current policy \pi and alternates:

  • evaluate how good that policy is
  • improve it by acting greedily with respect to its value

So the algorithm climbs through policies rather than only through value guesses.

5.3 Value Iteration Is Cheap Per Step, Policy Iteration Is Stronger Per Round

At first pass, the tradeoff is:

  • value iteration: simpler update, often many iterations
  • policy iteration: heavier iteration, often more aggressive improvement

The point is not that one always dominates.

The point is that both are natural consequences of Bellman structure.

5.4 Approximation Starts When Exact Storage Breaks

Exact value tables make sense for small finite state spaces.

But large or continuous systems force a different question:

  • what if we cannot store one exact number for every state?

Then we approximate:

  • value functions with features or function classes
  • policies with parameterized rules
  • Bellman updates by sampling or projection

That is approximate dynamic programming at first pass.

6 Formal Core

For a discounted infinite-horizon problem, let T be the Bellman optimality operator.

Definition 1 (Definition: Value Iteration) Given an initial guess V_0, define

\[ V_{k+1}=T V_k. \]

This is the simplest Bellman-based algorithm.

Theorem 1 (Theorem Idea: Value Iteration Converges) Because T is a contraction,

\[ \|V_k - V^*\|_\infty \to 0 \]

as k \to \infty.

At first pass, read this as:

  • every Bellman update improves the global value guess
  • discounting prevents error from being amplified

Definition 2 (Definition Idea: Policy Evaluation) For a fixed policy \pi, its value function V^\pi satisfies a policy-specific Bellman equation:

\[ V^\pi = T^\pi V^\pi, \]

where T^\pi means: follow policy \pi instead of minimizing over actions.

Definition 3 (Definition Idea: Policy Improvement) Given a value function estimate V, define a new policy by choosing an action greedy with respect to

\[ \mathbb{E}[g(s,a,S')+\gamma V(S')]. \]

So improvement means: use the current value estimate to choose better actions.

Theorem 2 (Theorem Idea: Policy Iteration) Policy iteration alternates:

  1. evaluate the current policy
  2. improve it greedily

Under the standard discounted finite-state setup, this process converges to an optimal policy.

Theorem 3 (Theorem Idea: Approximate Dynamic Programming) When exact value functions or exact Bellman updates are too expensive, we replace them by approximations:

  • truncated computation
  • sampled expectations
  • restricted function classes
  • projected or parameterized value updates

At first pass, the key lesson is simple:

approximate dynamic programming begins where exact Bellman computation stops fitting the problem size

7 Worked Example

Reuse the discounted two-state maintenance example with states G and B.

Suppose we start value iteration with

\[ V_0(G)=0,\qquad V_0(B)=0. \]

Then the first Bellman update gives:

  • for G, the value stays near 0 because immediate cost is 0
  • for B, the update compares:
    • repair: cost 4 + \gamma V_0(G)=4
    • wait: cost 1 + \gamma V_0(B)=1

So after one step, V_1(B) prefers wait.

After more iterations, the future cost of remaining in B starts to accumulate through the updated V_k(B), and eventually repair may become more attractive.

This is the algorithmic point:

  • early iterations see only short-horizon consequences
  • later iterations fold in longer-horizon consequences

Policy iteration on the same example would instead:

  1. start from a candidate policy, say “always wait in B
  2. evaluate that policy exactly or by solving its policy equation
  3. improve the policy by checking whether repair becomes better anywhere

So value iteration updates values directly, while policy iteration moves through complete policies.

8 Computation Lens

When you meet a dynamic-programming algorithm, ask:

  1. what object is being updated: values, policies, or both?
  2. is the Bellman update exact or approximate?
  3. what is expensive here: expectation, maximization or minimization, or storage?
  4. is the state space small enough for tables, or does it force approximation?
  5. what theorem explains convergence: contraction, monotonic improvement, or only an approximate guarantee?

Those questions usually reveal whether you are looking at classical DP, approximate DP, or an RL-style variant.

9 Application Lens

9.1 Planning In Small Finite Models

Value iteration and policy iteration are the canonical exact algorithms for finite discounted MDPs.

9.2 Control And Operations

Queueing, inventory, maintenance, and stochastic scheduling often reduce to repeated Bellman computations of exactly this kind.

9.3 RL And Approximation

Modern RL often reuses the same Bellman logic but replaces exact expectations, exact models, or exact value representations with sampling and approximation.

10 Stop Here For First Pass

If you stop here, retain these five ideas:

  • value iteration means repeated Bellman updates toward the optimal fixed point
  • policy iteration alternates policy evaluation and policy improvement
  • both algorithms are direct consequences of Bellman structure
  • value iteration is usually simpler per update, policy iteration usually stronger per round
  • approximate dynamic programming begins when exact tables or exact Bellman updates become too expensive

11 Go Deeper

The strongest next page is:

The strongest adjacent live pages are:

12 Optional Deeper Reading After First Pass

13 Sources and Further Reading

Back to top