Finite-Horizon Dynamic Programming and Backward Induction
dynamic programming, Bellman recursion, backward induction, finite horizon, value function
1 Role
This is the second page of the Stochastic Control and Dynamic Programming module.
Its job is to explain the first core algorithmic idea of the field:
- break a sequential decision problem into stages
- define a value or cost-to-go function at each remaining time
- solve from the end backward
This is the point where the MDP model becomes a computable optimization procedure.
2 First-Pass Promise
Read this page after Controlled Markov Models, Policies, and Cost Functionals.
If you stop here, you should still understand:
- what the principle of optimality says
- why finite-horizon problems are solved backward in time
- what a Bellman recursion is
- how an optimal policy is extracted from value functions
3 Why It Matters
A sequential decision problem may look impossible at first:
- many time steps
- many possible actions
- future randomness
- future decisions depending on future states
Brute force would try to optimize over all full policies at once.
Dynamic programming says there is a better way.
At a first pass:
- define the best future performance starting from a given state and time
- treat that future performance as a number attached to the state
- optimize the current action using that number
- recurse backward until the first decision
This is the idea behind Bellman recursions, backward induction, and eventually value iteration and many RL algorithms.
4 Prerequisite Recall
- an MDP specifies states, actions, a transition law, and a reward or cost objective
- a policy induces a stochastic trajectory
- a finite horizon means there are only
Tstages before a terminal time - expectation is needed because one action leads to a distribution over next states rather than one deterministic next state
5 Intuition
5.1 The Last Decision Is The Easiest
Suppose there is only one decision left before the terminal time.
Then the problem is simple:
- evaluate each action
- include the immediate cost and terminal effect
- pick the best one
So the end of the horizon is easy.
5.2 Use The Future Solution To Solve The Present
Now suppose you know the optimal cost of all states at time t+1.
Then at time t, you can evaluate each action by combining:
- current stage cost
- expected optimal future cost after the transition
That turns the current step into a one-step optimization problem.
5.3 This Is The Principle Of Optimality
The point is not that the future disappears.
It is that the future can be summarized by the next value function.
So the optimal decision now depends on the next state only through its optimal future value.
5.4 Backward Induction Is Dynamic Programming Over Time
Because the terminal time is easiest, we start there and move backward:
- terminal condition first
- then time
T-1 - then time
T-2 - and so on until time
0
That is why finite-horizon dynamic programming is naturally a backward algorithm.
6 Formal Core
For a first pass, use a finite-horizon cost-minimization problem with horizon T.
Definition 1 (Definition: Cost-to-Go Function) Let J_t^*(s) denote the optimal expected future cost from time t onward if the current state is s.
This function summarizes the whole remaining problem from the point (t,s).
Theorem 1 (Theorem Idea: Principle of Optimality) If a policy is optimal from time t onward, then after the first decision is made and the next state is realized, the remaining decisions must form an optimal policy for the resulting subproblem.
At a first pass, read this as:
an optimal strategy must remain optimal on every tail subproblem it creates
Theorem 2 (Theorem Idea: Finite-Horizon Bellman Recursion) With terminal cost g_T, the optimal cost-to-go satisfies
\[ J_T^*(s)=g_T(s), \]
and for t=T-1,T-2,\dots,0,
\[ J_t^*(s)=\min_{a\in A(s)} \mathbb{E}\!\left[g_t(s,a,S_{t+1}) + J_{t+1}^*(S_{t+1}) \mid S_t=s, A_t=a\right]. \]
This is the Bellman recursion for the finite-horizon problem.
If the problem is reward-maximizing, the \min becomes \max.
Theorem 3 (Theorem Idea: Policy Extraction) An optimal action at time t and state s is any action that attains the minimum in the Bellman recursion.
So once the value functions are known, the optimal policy is recovered one stage at a time.
7 Worked Example
Consider a two-stage queue-control problem.
State:
S_t= current queue length
Action:
A_tin{slow, fast}
Costs:
- immediate congestion cost proportional to queue length
- extra control cost if
fastis used - terminal cost equal to the final queue length
Backward induction works like this:
- start at time
Twith terminal costJ_T^*(s)=s - compute
J_{T-1}^*(s)by comparing:- slow service: lower control cost, worse expected next queue
- fast service: higher control cost, better expected next queue
- once
J_{T-1}^*is known, computeJ_{T-2}^*the same way using the already-computed future values
What matters at first pass is the pattern:
- the terminal rule is explicit
- the penultimate step uses the terminal rule
- the earlier step uses the penultimate rule
So the full sequential problem is solved by a chain of one-step optimizations.
8 Computation Lens
When you see a finite-horizon Bellman problem, ask:
- what is the state?
- what is the terminal cost or reward?
- what quantity is being stored in the value function: total cost, discounted reward, or something else?
- what expectation is being taken over?
- once
J_{t+1}^*is known, what one-step optimization remains at timet?
Those questions usually make the recursion readable immediately.
9 Application Lens
9.1 Planning With A Deadline
Many operations problems have natural horizons: inventory over a season, queue control over a shift, mission planning over a finite window.
9.2 Control Over A Finite Task
Trajectory optimization, finite-horizon LQR, and MPC all reuse this same backward-cost-to-go logic, though sometimes in more structured or constrained forms.
9.3 RL Foundations
Even when modern RL uses approximation or sampling, the underlying Bellman idea often begins with the finite-horizon dynamic-programming picture.
10 Stop Here For First Pass
If you stop here, retain these five ideas:
- finite-horizon sequential problems are naturally solved from the end backward
- the value function summarizes the optimal future from a given state and time
- the principle of optimality justifies solving tail subproblems recursively
- the Bellman recursion is a one-step optimization plus expected future value
- an optimal policy is extracted from the action that attains the Bellman minimum or maximum
That is enough to read most first-pass finite-horizon dynamic-programming statements without getting lost.
11 Go Deeper
The next natural step in this module is:
The strongest adjacent live pages are:
12 Optional Deeper Reading After First Pass
If you want a stronger second pass on the same ideas, use:
- MIT 6.231: Dynamic Programming and Stochastic Control for the full official lecture-slide arc from finite-horizon DP into infinite-horizon and approximate methods. Checked
2026-04-25. - MIT 6.231 lecture 2 for the principle of optimality, the basic problem, and the general DP algorithm. Checked
2026-04-25. - Stanford MS&E 235A notes index for the current official sequence of lecture notes in an MDP course. Checked
2026-04-25. - Stanford MS&E 235A / EE 283: Markov Decision Processes for a current official course page that connects modeling and algorithms. Checked
2026-04-25. - Stanford AA228 / CS238 for a current computational route into dynamic programming and planning under uncertainty. Checked
2026-04-25.
13 Sources and Further Reading
- MIT 6.231: Dynamic Programming and Stochastic Control -
First pass- official slide index for the whole DP and stochastic-control arc. Checked2026-04-25. - MIT 6.231 lecture 2 -
First pass- official resource for the principle of optimality and the basic dynamic-programming algorithm. Checked2026-04-25. - Stanford MS&E 235A notes index -
First pass- official current note index for a full MDP course. Checked2026-04-25. - Stanford MS&E 235A / EE 283: Markov Decision Processes -
First pass- official current course page for sequential decision under uncertainty. Checked2026-04-25. - Stanford AA228 / CS238 -
Second pass- official current course page for computational planning and decision making under uncertainty. Checked2026-04-25.