Controlled Markov Models, Policies, and Cost Functionals

How Markov decision processes model sequential decision-making under uncertainty through states, actions, transition laws, policies, and objectives.
Modified

April 26, 2026

Keywords

Markov decision process, policy, transition law, reward, cost functional

1 Role

This is the first page of the Stochastic Control and Dynamic Programming module.

Its job is to introduce the basic object that the rest of the module studies:

  • a state that evolves over time
  • actions that influence that evolution
  • randomness in the transition
  • a policy that chooses actions
  • a cumulative objective that scores the resulting trajectory

This is the point where control, planning, and reinforcement learning all start talking about the same mathematical object.

2 First-Pass Promise

Read this page first in the module.

If you stop here, you should still understand:

  • what an MDP or controlled Markov model is
  • what a policy is
  • what it means for the state process to be Markov
  • how sequential objectives are written as reward or cost functionals

3 Why It Matters

In static optimization, you choose one variable and optimize one objective.

In sequential decision-making, the problem is harder:

  • decisions happen over time
  • the future depends on both your actions and randomness
  • the current choice affects what options you will face later

That is why the right object is no longer a single optimization variable.

It is a controlled stochastic process.

At a first pass:

  • the state summarizes the information needed to make the next decision
  • the action changes the distribution of the next state
  • the policy tells us what action to choose
  • the objective scores the entire path, not only one step

This viewpoint is the load-bearing abstraction behind stochastic control, dynamic programming, and a large part of RL.

4 Prerequisite Recall

  • from Control and Dynamics, a state captures the memory the future still needs from the past
  • from Probability, conditional distributions describe what randomness remains after conditioning on current information
  • from optimization, a decision rule becomes meaningful only after an objective is fixed

5 Intuition

5.1 The Markov State Is A Sufficient Memory

The Markov idea is not that the past does not matter.

It is that the state has been chosen so that the relevant effect of the past is already encoded.

So once the current state and action are known, the distribution of the next state no longer needs the whole earlier history.

5.2 Actions Shape Distributions, Not Just Deterministic Updates

In deterministic control, an action may map one state to one next state.

In stochastic control, an action usually changes a transition law.

So choosing an action means choosing among probability distributions over what might happen next.

5.3 A Policy Is A Rule, Not A One-Time Choice

A policy is not a single action chosen once.

It is a rule for choosing actions across time, possibly as a function of the current state and time index.

That is why sequential decision-making is inherently more structured than one-shot optimization.

5.4 The Objective Lives On Whole Trajectories

We do not usually care only about one reward at one time.

Instead we care about things like:

  • total reward over a horizon
  • discounted sum of future rewards
  • average long-run cost
  • probability of failure before termination

So the object to optimize is a trajectory-level functional induced by the policy.

6 Formal Core

For a first pass, think of a discrete-time controlled Markov model.

Definition 1 (Definition Idea: Controlled Markov Model / MDP) A Markov decision process is typically specified by:

  • a state space S
  • an action space A
  • a transition law P(s'|s,a)
  • a reward or cost function
  • sometimes an initial-state distribution and a horizon or discount factor

The central Markov property is:

once the current state and action are known, the next-state distribution depends on nothing else from the past

Definition 2 (Definition: Policy) A policy is a rule that maps currently available information to an action.

At a first pass, the most common object is a state-feedback policy:

\[ a_t = \pi_t(s_t) \]

for finite horizon, or

\[ a_t = \pi(s_t) \]

for a stationary policy.

Definition 3 (Definition Idea: Cost Functional) A common finite-horizon objective is

\[ J^\pi = \mathbb{E}\left[\sum_{t=0}^{T-1} c_t(S_t,A_t) + c_T(S_T)\right]. \]

A common reward version replaces c_t with rewards r_t and asks for maximization instead.

This is the basic object that later Bellman recursions will optimize.

Theorem 1 (Theorem Idea: A Policy Induces A Controlled Stochastic Process) Once a policy \pi is fixed, the controlled model induces a probability law on trajectories

\[ (S_0,A_0,S_1,A_1,\dots). \]

So the policy is what converts the abstract model into an actual stochastic system with a measurable objective value.

7 Worked Example

Consider a queue with current length S_t.

At each time:

  • a new customer arrives with some probability
  • the controller chooses A_t in {slow, fast}
  • the fast mode serves more aggressively but may incur larger operating cost

This is a controlled Markov model because:

  • the state is the current queue length
  • the action is the chosen service mode
  • the next-state distribution depends on arrival randomness and service success
  • a cost might combine queue length and service effort

A natural policy question is:

when should we switch from slow service to fast service?

That already has the shape of a real stochastic control problem:

  • current congestion matters
  • actions change the transition probabilities
  • one-step choices trade future congestion against current cost

8 Computation Lens

When you see a sequential decision problem, ask:

  1. what is the state, and does it really capture the needed memory?
  2. what choices are actions versus what enters as random disturbance?
  3. what class of policies is being optimized over?
  4. is the objective finite-horizon, discounted, average-cost, or stopping-based?
  5. if the policy were fixed, what stochastic process would it induce?

Those questions usually reveal the real mathematical skeleton of the problem.

9 Application Lens

9.1 Control Under Uncertainty

This is the stochastic extension of state-space control: actions steer a system whose next state is random.

9.2 Planning And Operations

Inventory, queues, pricing, resource allocation, and maintenance all fit naturally into the same template.

9.3 RL

RL often starts from the same MDP object, but then shifts attention toward learning good policies from data, samples, or interaction rather than assuming the full model is known.

10 Stop Here For First Pass

If you stop here, retain these five ideas:

  • an MDP is a controlled stochastic state process
  • the Markov state is a sufficient summary of the past for predicting the next step under the current action
  • a policy is a decision rule over time, not a one-shot choice
  • the action changes the transition law, not only the immediate reward
  • the objective is a trajectory-level reward or cost functional

That is enough to read later Bellman and dynamic-programming pages without getting lost in notation.

11 Go Deeper

The next natural step in this module is:

The strongest adjacent live pages are:

12 Optional Deeper Reading After First Pass

If you want a stronger second pass on the same ideas, use:

13 Sources and Further Reading

Back to top