Partial Observability, Belief States, and RL/Control Bridges
partial observability, belief state, POMDP, filtering, reinforcement learning
1 Role
This is the seventh page of the Stochastic Control and Dynamic Programming module.
Its job is to explain what changes when the controller does not directly observe the state.
That is the missing final step in the first-pass spine:
- first we planned with known states
- then we handled stochastic dynamics
- now we handle hidden state and noisy observations
This is where MDP reasoning becomes POMDP reasoning.
2 First-Pass Promise
Read this page after Stochastic Linear Systems, LQG, and the Separation Principle.
If you already read Continuous-Time Stochastic Control and Hamilton-Jacobi-Bellman Intuition, that helps, but it is not a hard prerequisite for the hidden-state story.
If you stop here, you should still understand:
- why partial observability changes the decision problem
- what a belief state is
- why belief states restore a Markov-style planning viewpoint
- where control and RL meet in partially observed systems
3 Why It Matters
Many decision systems do not reveal the true state directly.
We may only see:
- noisy sensor readings
- delayed outputs
- partial measurements
- proxies rather than the latent variable we care about
So choosing actions from the raw observation alone can be fundamentally insufficient.
The system may look ambiguous:
- the same observation can correspond to different hidden states
- different hidden states may require different actions
The right object is then not the hidden state itself, because we do not know it.
It is the belief about the hidden state.
That belief becomes the new information state for planning.
4 Prerequisite Recall
- in an MDP, the current state is enough to summarize all relevant past information
- Kalman filtering already showed one structured example where noisy observations are turned into a state estimate
- stochastic control introduced value functions and Bellman-style planning under uncertainty
- probability and statistics matter because beliefs are conditional distributions updated by evidence
5 Intuition
5.1 Observation Is Not State
In a partially observed problem, the controller receives an observation o_t, not the hidden state s_t.
So the immediate question is:
what should the agent remember from the past in order to act well now?
The answer is not usually the last observation by itself.
5.2 The Belief State Summarizes What We Know
A belief state is the conditional distribution of the hidden state given everything observed so far.
At first pass, read it as:
the current posterior over the hidden state
So instead of planning in the original state space, we plan in a space of beliefs.
5.3 Belief Dynamics Restore A Markov Story
This is the key structural idea.
Although the hidden state is not observed, the belief state evolves by a recursive update using:
- previous belief
- chosen action
- new observation
That means the belief state itself is Markov.
So the partially observed problem can be recast as a fully observed decision problem on belief space.
5.4 This Is Where Control Meets RL Most Directly
In classical control, the bridge is:
- estimate hidden state
- act using that estimate
In RL and planning, the bridge is:
- maintain a belief or latent state
- optimize future behavior from that information state
Both are trying to solve the same problem:
act well when the true state is not directly available
6 Formal Core
Definition 1 (Definition: POMDP) A partially observed Markov decision process includes:
- a hidden state
S_t - an action
A_t - a state transition law
- an observation
O_t - an observation law conditioned on the hidden state
- a reward or cost rule
The hidden state drives the system, but the decision-maker only sees observations.
Definition 2 (Definition: Belief State) The belief state b_t is the conditional distribution of S_t given the action-observation history up to time t.
At first pass, it is enough to remember:
belief = posterior over hidden state
Theorem 1 (Theorem Idea: Belief Update) Given the previous belief, the chosen action, and the new observation, the next belief is determined by a Bayesian filtering update.
So the belief evolves recursively, just like a state estimate.
Theorem 2 (Theorem Idea: Belief-State Reduction) The partially observed problem can be rewritten as a fully observed control problem on the belief state.
That is the structural reason dynamic programming still applies.
Theorem 3 (Theorem Idea: Bellman Equation On Belief Space) If we treat the belief b as the state, then the optimal value function satisfies a Bellman equation over beliefs:
\[ V(b)=\min_a \left\{ c(b,a) + \mathbb{E}[V(b') \mid b,a] \right\} \]
or the reward-maximizing analog.
At first pass, the point is not the exact formula.
It is that Bellman reasoning survives, but now on a harder state space: the space of distributions.
7 Worked Example
Imagine a robot moving in a corridor with two hidden locations:
LeftRight
It can:
- move
- observe a noisy beacon
Suppose the beacon reading is imperfect:
- the same reading may occur in both locations
- but with different probabilities
If the robot only uses the latest reading, it may oscillate or make inconsistent choices.
If instead it keeps a belief
\[ b_t(\text{Left}),\quad b_t(\text{Right}), \]
then each action and observation update that belief.
Planning can now use:
- immediate action cost
- expected future cost under the updated belief
So the robot is not planning over the true hidden location directly.
It is planning over its current uncertainty about location.
That is the first-pass heart of POMDP reasoning.
8 Computation Lens
When you meet a partially observed decision problem, ask:
- what is hidden?
- what is actually observed?
- what belief or filtered estimate summarizes the useful past?
- is the method doing exact belief updates, approximate filtering, or learned latent-state tracking?
- is the control or RL method planning in the original state space, the belief space, or a learned surrogate?
Those questions usually reveal whether a paper is doing classical POMDP reasoning, structured filtering and control, or modern approximate RL.
9 Application Lens
9.3 RL And Sequential Inference
Many RL problems are effectively partially observed, so memory, filtering, latent-state estimation, and belief tracking become central rather than optional.
10 Stop Here For First Pass
If you stop here, retain these five ideas:
- partial observability means the controller does not directly know the true state
- the right replacement object is the belief state
- the belief state is a posterior distribution updated from actions and observations
- belief states restore a Markov planning viewpoint
- this is a core meeting point of filtering, control, planning, and RL
11 Go Deeper
The strongest adjacent live pages are:
- Stochastic Control and Dynamic Programming
- Stochastic Linear Systems, LQG, and the Separation Principle
- Estimation, Kalman Filtering, and the Separation Principle
- Learning-Based Control, System Identification, and RL Bridges
- Probability
The strongest downstream bridges are:
12 Optional Deeper Reading After First Pass
- Stanford AA228 / CS238 - official current course page for decision making under uncertainty with MDP, POMDP, and RL-facing framing. Checked
2026-04-25. - AA228/CS238 solutions: State Uncertainty - official course material with direct explanations of POMDPs, belief states, and belief updates. Checked
2026-04-25. - Decision Making Under Uncertainty text - official Stanford-hosted text with explicit sections on POMDPs, belief-state MDPs, and belief updating. Checked
2026-04-25. - POMDP slides - official Stanford-hosted slides focused directly on POMDPs and belief states. Checked
2026-04-25. - Stanford MS&E 235A / EE 283: Markov Decision Processes - official current course page for the broader MDP and dynamic-programming arc. Checked
2026-04-25.
13 Sources and Further Reading
- Stanford AA228 / CS238 -
First pass- official current decision-making-under-uncertainty course page with POMDP and RL-facing framing. Checked2026-04-25. - AA228/CS238 solutions: State Uncertainty -
First pass- official course material with direct belief-state and belief-update explanations. Checked2026-04-25. - Decision Making Under Uncertainty text -
First pass- official Stanford-hosted text with explicit POMDP and belief-state sections. Checked2026-04-25. - POMDP slides -
First pass- official Stanford-hosted slides focused directly on partial observability and belief states. Checked2026-04-25. - Stanford MS&E 235A / EE 283: Markov Decision Processes -
Second pass- official current course page for the broader MDP and dynamic-programming context. Checked2026-04-25.