Controlled Markov Models, Policies, and Cost Functionals
Markov decision process, policy, transition law, reward, cost functional
1 Role
This is the first page of the Stochastic Control and Dynamic Programming module.
Its job is to introduce the basic object that the rest of the module studies:
- a state that evolves over time
- actions that influence that evolution
- randomness in the transition
- a policy that chooses actions
- a cumulative objective that scores the resulting trajectory
This is the point where control, planning, and reinforcement learning all start talking about the same mathematical object.
2 First-Pass Promise
Read this page first in the module.
If you stop here, you should still understand:
- what an MDP or controlled Markov model is
- what a policy is
- what it means for the state process to be Markov
- how sequential objectives are written as reward or cost functionals
3 Why It Matters
In static optimization, you choose one variable and optimize one objective.
In sequential decision-making, the problem is harder:
- decisions happen over time
- the future depends on both your actions and randomness
- the current choice affects what options you will face later
That is why the right object is no longer a single optimization variable.
It is a controlled stochastic process.
At a first pass:
- the
statesummarizes the information needed to make the next decision - the
actionchanges the distribution of the next state - the
policytells us what action to choose - the
objectivescores the entire path, not only one step
This viewpoint is the load-bearing abstraction behind stochastic control, dynamic programming, and a large part of RL.
4 Prerequisite Recall
- from Control and Dynamics, a state captures the memory the future still needs from the past
- from Probability, conditional distributions describe what randomness remains after conditioning on current information
- from optimization, a decision rule becomes meaningful only after an objective is fixed
5 Intuition
5.1 The Markov State Is A Sufficient Memory
The Markov idea is not that the past does not matter.
It is that the state has been chosen so that the relevant effect of the past is already encoded.
So once the current state and action are known, the distribution of the next state no longer needs the whole earlier history.
5.2 Actions Shape Distributions, Not Just Deterministic Updates
In deterministic control, an action may map one state to one next state.
In stochastic control, an action usually changes a transition law.
So choosing an action means choosing among probability distributions over what might happen next.
5.3 A Policy Is A Rule, Not A One-Time Choice
A policy is not a single action chosen once.
It is a rule for choosing actions across time, possibly as a function of the current state and time index.
That is why sequential decision-making is inherently more structured than one-shot optimization.
5.4 The Objective Lives On Whole Trajectories
We do not usually care only about one reward at one time.
Instead we care about things like:
- total reward over a horizon
- discounted sum of future rewards
- average long-run cost
- probability of failure before termination
So the object to optimize is a trajectory-level functional induced by the policy.
6 Formal Core
For a first pass, think of a discrete-time controlled Markov model.
Definition 1 (Definition Idea: Controlled Markov Model / MDP) A Markov decision process is typically specified by:
- a state space
S - an action space
A - a transition law
P(s'|s,a) - a reward or cost function
- sometimes an initial-state distribution and a horizon or discount factor
The central Markov property is:
once the current state and action are known, the next-state distribution depends on nothing else from the past
Definition 2 (Definition: Policy) A policy is a rule that maps currently available information to an action.
At a first pass, the most common object is a state-feedback policy:
\[ a_t = \pi_t(s_t) \]
for finite horizon, or
\[ a_t = \pi(s_t) \]
for a stationary policy.
Definition 3 (Definition Idea: Cost Functional) A common finite-horizon objective is
\[ J^\pi = \mathbb{E}\left[\sum_{t=0}^{T-1} c_t(S_t,A_t) + c_T(S_T)\right]. \]
A common reward version replaces c_t with rewards r_t and asks for maximization instead.
This is the basic object that later Bellman recursions will optimize.
Theorem 1 (Theorem Idea: A Policy Induces A Controlled Stochastic Process) Once a policy \pi is fixed, the controlled model induces a probability law on trajectories
\[ (S_0,A_0,S_1,A_1,\dots). \]
So the policy is what converts the abstract model into an actual stochastic system with a measurable objective value.
7 Worked Example
Consider a queue with current length S_t.
At each time:
- a new customer arrives with some probability
- the controller chooses
A_tin{slow, fast} - the fast mode serves more aggressively but may incur larger operating cost
This is a controlled Markov model because:
- the state is the current queue length
- the action is the chosen service mode
- the next-state distribution depends on arrival randomness and service success
- a cost might combine queue length and service effort
A natural policy question is:
when should we switch from slow service to fast service?
That already has the shape of a real stochastic control problem:
- current congestion matters
- actions change the transition probabilities
- one-step choices trade future congestion against current cost
8 Computation Lens
When you see a sequential decision problem, ask:
- what is the state, and does it really capture the needed memory?
- what choices are actions versus what enters as random disturbance?
- what class of policies is being optimized over?
- is the objective finite-horizon, discounted, average-cost, or stopping-based?
- if the policy were fixed, what stochastic process would it induce?
Those questions usually reveal the real mathematical skeleton of the problem.
9 Application Lens
9.1 Control Under Uncertainty
This is the stochastic extension of state-space control: actions steer a system whose next state is random.
9.2 Planning And Operations
Inventory, queues, pricing, resource allocation, and maintenance all fit naturally into the same template.
9.3 RL
RL often starts from the same MDP object, but then shifts attention toward learning good policies from data, samples, or interaction rather than assuming the full model is known.
10 Stop Here For First Pass
If you stop here, retain these five ideas:
- an MDP is a controlled stochastic state process
- the Markov state is a sufficient summary of the past for predicting the next step under the current action
- a policy is a decision rule over time, not a one-shot choice
- the action changes the transition law, not only the immediate reward
- the objective is a trajectory-level reward or cost functional
That is enough to read later Bellman and dynamic-programming pages without getting lost in notation.
11 Go Deeper
The next natural step in this module is:
The strongest adjacent live pages are:
12 Optional Deeper Reading After First Pass
If you want a stronger second pass on the same ideas, use:
- MIT 6.231: Dynamic Programming and Stochastic Control for the broad official lecture-slide arc from modeling to algorithms. Checked
2026-04-25. - Stanford MS&E 235A / EE 283: Markov Decision Processes for a current official course page centered exactly on MDP formulation and solution. Checked
2026-04-25. - Stanford MS&E 235A lecture 1 for a focused current note on MDP specification and transition probabilities. Checked
2026-04-25. - Stanford MS&E 235A lecture 3 for a focused current note on reward functions and decision objectives. Checked
2026-04-25. - Stanford AA228 / CS238 for a current computational decision-making-under-uncertainty route with MDP, POMDP, and RL applications. Checked
2026-04-25.
13 Sources and Further Reading
- MIT 6.231: Dynamic Programming and Stochastic Control -
First pass- official lecture-slide index for the modeling and dynamic-programming backbone of the field. Checked2026-04-25. - Stanford MS&E 235A / EE 283: Markov Decision Processes -
First pass- official current course page for the MDP viewpoint and its role in sequential decision-making. Checked2026-04-25. - Stanford MS&E 235A lecture 1 -
First pass- official current note for transition-law specification and the MDP tuple. Checked2026-04-25. - Stanford MS&E 235A lecture 3 -
First pass- official current note for rewards, utility, and objective specification. Checked2026-04-25. - Stanford AA228 / CS238 -
Second pass- official current course page connecting MDPs to broader decision-making under uncertainty and RL applications. Checked2026-04-25.