Partial Observability, Belief States, and RL/Control Bridges

How hidden state turns sequential decision-making into belief-state planning, why POMDPs extend MDPs, and where control and RL meet in partially observed settings.

Modified

April 26, 2026

Keywords

partial observability, belief state, POMDP, filtering, reinforcement learning

1 Role

This is the seventh page of the Stochastic Control and Dynamic Programming module.

Its job is to explain what changes when the controller does not directly observe the state.

That is the missing final step in the first-pass spine:

first we planned with known states
then we handled stochastic dynamics
now we handle hidden state and noisy observations

This is where MDP reasoning becomes POMDP reasoning.

2 First-Pass Promise

Read this page after Stochastic Linear Systems, LQG, and the Separation Principle.

If you already read Continuous-Time Stochastic Control and Hamilton-Jacobi-Bellman Intuition, that helps, but it is not a hard prerequisite for the hidden-state story.

If you stop here, you should still understand:

why partial observability changes the decision problem
what a belief state is
why belief states restore a Markov-style planning viewpoint
where control and RL meet in partially observed systems

3 Why It Matters

Many decision systems do not reveal the true state directly.

We may only see:

noisy sensor readings
delayed outputs
partial measurements
proxies rather than the latent variable we care about

So choosing actions from the raw observation alone can be fundamentally insufficient.

The system may look ambiguous:

the same observation can correspond to different hidden states
different hidden states may require different actions

The right object is then not the hidden state itself, because we do not know it.

It is the belief about the hidden state.

That belief becomes the new information state for planning.

4 Prerequisite Recall

in an MDP, the current state is enough to summarize all relevant past information
Kalman filtering already showed one structured example where noisy observations are turned into a state estimate
stochastic control introduced value functions and Bellman-style planning under uncertainty
probability and statistics matter because beliefs are conditional distributions updated by evidence

5 Intuition

5.1 Observation Is Not State

In a partially observed problem, the controller receives an observation o_t, not the hidden state s_t.

So the immediate question is:

what should the agent remember from the past in order to act well now?

The answer is not usually the last observation by itself.

5.2 The Belief State Summarizes What We Know

A belief state is the conditional distribution of the hidden state given everything observed so far.

At first pass, read it as:

the current posterior over the hidden state

So instead of planning in the original state space, we plan in a space of beliefs.

5.3 Belief Dynamics Restore A Markov Story

This is the key structural idea.

Although the hidden state is not observed, the belief state evolves by a recursive update using:

previous belief
chosen action
new observation

That means the belief state itself is Markov.

So the partially observed problem can be recast as a fully observed decision problem on belief space.

5.4 This Is Where Control Meets RL Most Directly

In classical control, the bridge is:

estimate hidden state
act using that estimate

In RL and planning, the bridge is:

maintain a belief or latent state
optimize future behavior from that information state

Both are trying to solve the same problem:

act well when the true state is not directly available

6 Formal Core

Definition 1 (Definition: POMDP) A partially observed Markov decision process includes:

a hidden state S_t
an action A_t
a state transition law
an observation O_t
an observation law conditioned on the hidden state
a reward or cost rule

The hidden state drives the system, but the decision-maker only sees observations.

Definition 2 (Definition: Belief State) The belief state b_t is the conditional distribution of S_t given the action-observation history up to time t.

At first pass, it is enough to remember:

belief = posterior over hidden state

Theorem 1 (Theorem Idea: Belief Update) Given the previous belief, the chosen action, and the new observation, the next belief is determined by a Bayesian filtering update.

So the belief evolves recursively, just like a state estimate.

Theorem 2 (Theorem Idea: Belief-State Reduction) The partially observed problem can be rewritten as a fully observed control problem on the belief state.

That is the structural reason dynamic programming still applies.

Theorem 3 (Theorem Idea: Bellman Equation On Belief Space) If we treat the belief b as the state, then the optimal value function satisfies a Bellman equation over beliefs:

\[ V(b)=\min_a \left\{ c(b,a) + \mathbb{E}[V(b') \mid b,a] \right\} \]

or the reward-maximizing analog.

At first pass, the point is not the exact formula.

It is that Bellman reasoning survives, but now on a harder state space: the space of distributions.

7 Worked Example

Imagine a robot moving in a corridor with two hidden locations:

Left
Right

It can:

move
observe a noisy beacon

Suppose the beacon reading is imperfect:

the same reading may occur in both locations
but with different probabilities

If the robot only uses the latest reading, it may oscillate or make inconsistent choices.

If instead it keeps a belief

\[ b_t(\text{Left}),\quad b_t(\text{Right}), \]

then each action and observation update that belief.

Planning can now use:

immediate action cost
expected future cost under the updated belief

So the robot is not planning over the true hidden location directly.

It is planning over its current uncertainty about location.

That is the first-pass heart of POMDP reasoning.

8 Computation Lens

When you meet a partially observed decision problem, ask:

what is hidden?
what is actually observed?
what belief or filtered estimate summarizes the useful past?
is the method doing exact belief updates, approximate filtering, or learned latent-state tracking?
is the control or RL method planning in the original state space, the belief space, or a learned surrogate?

Those questions usually reveal whether a paper is doing classical POMDP reasoning, structured filtering and control, or modern approximate RL.

9 Application Lens

9.2 Operations And Hidden Regimes

Queueing, diagnosis, and maintenance problems often involve latent system modes that are only indirectly observed.

9.3 RL And Sequential Inference

Many RL problems are effectively partially observed, so memory, filtering, latent-state estimation, and belief tracking become central rather than optional.

10 Stop Here For First Pass

If you stop here, retain these five ideas:

partial observability means the controller does not directly know the true state
the right replacement object is the belief state
the belief state is a posterior distribution updated from actions and observations
belief states restore a Markov planning viewpoint
this is a core meeting point of filtering, control, planning, and RL

11 Go Deeper

The strongest adjacent live pages are:

The strongest downstream bridges are:

12 Optional Deeper Reading After First Pass

Stanford AA228 / CS238 - official current course page for decision making under uncertainty with MDP, POMDP, and RL-facing framing. Checked 2026-04-25.
AA228/CS238 solutions: State Uncertainty - official course material with direct explanations of POMDPs, belief states, and belief updates. Checked 2026-04-25.
Decision Making Under Uncertainty text - official Stanford-hosted text with explicit sections on POMDPs, belief-state MDPs, and belief updating. Checked 2026-04-25.
POMDP slides - official Stanford-hosted slides focused directly on POMDPs and belief states. Checked 2026-04-25.
Stanford MS&E 235A / EE 283: Markov Decision Processes - official current course page for the broader MDP and dynamic-programming arc. Checked 2026-04-25.

13 Sources and Further Reading

Stanford AA228 / CS238 - First pass - official current decision-making-under-uncertainty course page with POMDP and RL-facing framing. Checked 2026-04-25.
AA228/CS238 solutions: State Uncertainty - First pass - official course material with direct belief-state and belief-update explanations. Checked 2026-04-25.
Decision Making Under Uncertainty text - First pass - official Stanford-hosted text with explicit POMDP and belief-state sections. Checked 2026-04-25.
POMDP slides - First pass - official Stanford-hosted slides focused directly on partial observability and belief states. Checked 2026-04-25.
Stanford MS&E 235A / EE 283: Markov Decision Processes - Second pass - official current course page for the broader MDP and dynamic-programming context. Checked 2026-04-25.