Value Iteration, Policy Iteration, and Approximate Dynamic Programming

How Bellman fixed points become practical algorithms through value iteration, policy iteration, and approximation.

Modified

April 26, 2026

Keywords

value iteration, policy iteration, approximate dynamic programming, Bellman operator, reinforcement learning

1 Role

This is the fourth page of the Stochastic Control and Dynamic Programming module.

Its job is to answer the next natural question after Bellman fixed points:

if the optimal value is defined by a fixed-point equation, how do we compute it?

This is where the module turns the Bellman equation into actual algorithms.

2 First-Pass Promise

Read this page after Infinite-Horizon Value Functions, Bellman Equations, and Contractions.

If you stop here, you should still understand:

what value iteration does
what policy iteration does
why both algorithms are rooted in Bellman structure
why approximation becomes unavoidable in large state spaces

3 Why It Matters

The Bellman equation is elegant, but by itself it is only a characterization.

It tells us what the optimal value function must satisfy.

It does not yet tell us how to find that object at scale.

This page introduces the two classic algorithmic answers:

value iteration: repeatedly apply the Bellman operator
policy iteration: alternate between evaluating a policy and improving it

Then it makes one more step that matters for modern practice:

when the state space is large or continuous, exact dynamic programming becomes impossible
so we approximate value functions, policies, or Bellman updates

That is the entry point to approximate dynamic programming and many RL methods.

4 Prerequisite Recall

the Bellman optimality operator T maps one candidate value function to another
under discounted infinite-horizon assumptions, T is a contraction
the optimal value function V^* is the unique fixed point of T
a greedy action with respect to V^* is optimal

5 Intuition

5.1 Value Iteration Repeats The Bellman Update

If the Bellman operator has a unique fixed point and shrinks errors, then a simple idea becomes natural:

start with any guess V_0
apply T
apply T again
keep going

The contraction theorem says these repeated updates move toward V^*.

5.2 Policy Iteration Solves Easier Problems Along The Way

Instead of updating the value function directly, policy iteration keeps a current policy \pi and alternates:

evaluate how good that policy is
improve it by acting greedily with respect to its value

So the algorithm climbs through policies rather than only through value guesses.

5.3 Value Iteration Is Cheap Per Step, Policy Iteration Is Stronger Per Round

At first pass, the tradeoff is:

value iteration: simpler update, often many iterations
policy iteration: heavier iteration, often more aggressive improvement

The point is not that one always dominates.

The point is that both are natural consequences of Bellman structure.

5.4 Approximation Starts When Exact Storage Breaks

Exact value tables make sense for small finite state spaces.

But large or continuous systems force a different question:

what if we cannot store one exact number for every state?

Then we approximate:

value functions with features or function classes
policies with parameterized rules
Bellman updates by sampling or projection

That is approximate dynamic programming at first pass.

6 Formal Core

For a discounted infinite-horizon problem, let T be the Bellman optimality operator.

Definition 1 (Definition: Value Iteration) Given an initial guess V_0, define

\[ V_{k+1}=T V_k. \]

This is the simplest Bellman-based algorithm.

Theorem 1 (Theorem Idea: Value Iteration Converges) Because T is a contraction,

\[ \|V_k - V^*\|_\infty \to 0 \]

as k \to \infty.

At first pass, read this as:

every Bellman update improves the global value guess
discounting prevents error from being amplified

Definition 2 (Definition Idea: Policy Evaluation) For a fixed policy \pi, its value function V^\pi satisfies a policy-specific Bellman equation:

\[ V^\pi = T^\pi V^\pi, \]

where T^\pi means: follow policy \pi instead of minimizing over actions.

Definition 3 (Definition Idea: Policy Improvement) Given a value function estimate V, define a new policy by choosing an action greedy with respect to

\[ \mathbb{E}[g(s,a,S')+\gamma V(S')]. \]

So improvement means: use the current value estimate to choose better actions.

Theorem 2 (Theorem Idea: Policy Iteration) Policy iteration alternates:

evaluate the current policy
improve it greedily

Under the standard discounted finite-state setup, this process converges to an optimal policy.

Theorem 3 (Theorem Idea: Approximate Dynamic Programming) When exact value functions or exact Bellman updates are too expensive, we replace them by approximations:

truncated computation
sampled expectations
restricted function classes
projected or parameterized value updates

At first pass, the key lesson is simple:

approximate dynamic programming begins where exact Bellman computation stops fitting the problem size

7 Worked Example

Reuse the discounted two-state maintenance example with states G and B.

Suppose we start value iteration with

\[ V_0(G)=0,\qquad V_0(B)=0. \]

Then the first Bellman update gives:

for G, the value stays near 0 because immediate cost is 0
for B, the update compares:
- repair: cost 4 + \gamma V_0(G)=4
- wait: cost 1 + \gamma V_0(B)=1

So after one step, V_1(B) prefers wait.

After more iterations, the future cost of remaining in B starts to accumulate through the updated V_k(B), and eventually repair may become more attractive.

This is the algorithmic point:

early iterations see only short-horizon consequences
later iterations fold in longer-horizon consequences

Policy iteration on the same example would instead:

start from a candidate policy, say “always wait in B”
evaluate that policy exactly or by solving its policy equation
improve the policy by checking whether repair becomes better anywhere

So value iteration updates values directly, while policy iteration moves through complete policies.

8 Computation Lens

When you meet a dynamic-programming algorithm, ask:

what object is being updated: values, policies, or both?
is the Bellman update exact or approximate?
what is expensive here: expectation, maximization or minimization, or storage?
is the state space small enough for tables, or does it force approximation?
what theorem explains convergence: contraction, monotonic improvement, or only an approximate guarantee?

Those questions usually reveal whether you are looking at classical DP, approximate DP, or an RL-style variant.

9 Application Lens

9.1 Planning In Small Finite Models

Value iteration and policy iteration are the canonical exact algorithms for finite discounted MDPs.

9.2 Control And Operations

Queueing, inventory, maintenance, and stochastic scheduling often reduce to repeated Bellman computations of exactly this kind.

9.3 RL And Approximation

Modern RL often reuses the same Bellman logic but replaces exact expectations, exact models, or exact value representations with sampling and approximation.

10 Stop Here For First Pass

If you stop here, retain these five ideas:

value iteration means repeated Bellman updates toward the optimal fixed point
policy iteration alternates policy evaluation and policy improvement
both algorithms are direct consequences of Bellman structure
value iteration is usually simpler per update, policy iteration usually stronger per round
approximate dynamic programming begins when exact tables or exact Bellman updates become too expensive

11 Go Deeper

The strongest next page is:

Stochastic Linear Systems, LQG, and the Separation Principle

The strongest adjacent live pages are:

12 Optional Deeper Reading After First Pass

MIT 6.231: Dynamic Programming and Stochastic Control - official lecture-slide index covering value iteration, policy iteration, and approximate dynamic programming. Checked 2026-04-25.
MIT 6.231 lecture 4 - official notes page for value iteration and related infinite-horizon algorithms. Checked 2026-04-25.
MIT 6.231 lecture 5 - official notes page for policy iteration and approximate dynamic programming directions. Checked 2026-04-25.
Stanford MS&E 235A lecture 5 - official current notes for computational dynamic-programming algorithms in MDPs. Checked 2026-04-25.
Stanford MS&E 235A lecture 6 - official current notes continuing algorithmic MDP methods. Checked 2026-04-25.
Stanford AA228 / CS238 - official current course page connecting exact DP ideas to planning and RL. Checked 2026-04-25.

13 Sources and Further Reading

MIT 6.231: Dynamic Programming and Stochastic Control - First pass - official slide index for the stochastic-control and dynamic-programming arc. Checked 2026-04-25.
MIT 6.231 lecture 4 - First pass - official notes page for value iteration and discounted infinite-horizon computation. Checked 2026-04-25.
MIT 6.231 lecture 5 - First pass - official notes page for policy iteration and approximate-dynamic-programming directions. Checked 2026-04-25.
Stanford MS&E 235A / EE 283: Markov Decision Processes - First pass - official current course page for stochastic decision-making and dynamic programming. Checked 2026-04-25.
Stanford MS&E 235A lecture 5 - First pass - official current notes for computational MDP algorithms. Checked 2026-04-25.
Stanford MS&E 235A lecture 6 - First pass - official current notes continuing the algorithmic MDP thread. Checked 2026-04-25.
Stanford AA228 / CS238 - Second pass - official current course page with planning and RL-facing framing. Checked 2026-04-25.