Markov Decision Processes and Policy Evaluation
Markov Decision Process is the threotical foundation of reincement learning. This installment introduces the foundamental settings of Reinforcement Learning.
The Complete Decision Process
- We'll first formulate the complete decision process of RL problems.
Markov Decision Process
RL essentially seek to answer the question that how an agent learns to interact with the environment given certain rewards. Concretely, we'll use the Markov Decision Process model
Where is the state measurable space, is the state space, is a sigma algebra on . is the action measurable space, is the action space, is a sigma algebra on . is the state transition kernel, also named dynamics model, for it characterizing the dynamics of states. For any fixed , is a probability measure on , which represents the distribution of the next state when the environment is at state and the agent takse action . And for fixed , the function is -measurable. We further require that
be Markovian, that is for any history and , we have is the reward function on , also called reward model. For any fixed , represents the reward that the agent gains when taking action at state . Sometimes the reward function also depends on the next state , in this case will be on . is a discount factor used in the computation of returns, which we'll define later.
Policy
It could be seen that MDP defines the environment and interactions. We further define the policy of the agent
- The policy
is a probability measure on , which represents the distribution of the action that the agent would take at state .
In this way, we can define the complete decision process that: First an initial state
is given, then the agent takes action and gains reward , and the environment evolves to state , then repeats. The process can be shown as below Note that we actually impose Markov assumptions in the dynamics, action and reward models, though some are implicite. We require Markov property, without which the problem would be to too hard to handle. In addition, in some problems there might be different action spaces for different states, but we generally use a unified action space setting. - The policy
Optimization of RL
- Someone says RL has different optimization aspect from ordinary
supervised learning where typically a scalar function is minimized. In
RL we instead seek to find an optimal policy, which might be more
complicated. However I still think there should be no foundamental
difference, because the modeling will be written as a rigorous
optimization problem anyway. Concretely, there are
main questions- How to define the optimality of a policy.
- How to solve for the optimal policy.
Evaluation of policies
- Evaluation in RL particularly means to evaluate how good a given policy is. We'll characterize this using Markov Reward Process.
Markov Reward Process
MRP is basically an MDP with a given policy, which is defined as
Where is the state measurable space. is the state transition kernel. For any , is a probability measure on , representing the distribution of next state given the current state . Compared with the dynamics model in MDP, is basically marginalized using policy , written as is the reward function on . Compared with the reward function in MDP, is basically marginalized using policy , written as is the dicount factor, same as MDP.
MRPs differ from MDPs in that they don't involve actions
. And the transition kernel and reward function are marginalized using policy . This may sound as though some information is lost when taking expectations, but in fact we'll soon see that only the expectations are needed for the analysis. The state measurable space and the state transition kernel together form a Markov chain .
Return, Value function and Q-function
- Most of the time we evaluate a policy using MRP, except for
Q-function which depends on actions. Concretely, MRP is basically a
Markov chain with reward functions. Here we set an initial state
distribution
, and let . In the perspective of stochastic process, MRP gives a infinite trajectory of random variables Where and . Sometimes we are interested in the problems with finite episode with horizon 1, like a fixed horizon or certain stopping time , written as Such a episode has exactly transitions and rewards.
The return
is the discounted sum of future rewards starting from a given time . The return of an MRP is defined as For a finite-horizon episode, the return is defined as Note that itself is a random variable. People may ask why is there a discount factor . Naturally, we hope to evaluate an episode by the sum over rewards. But when horizons are infinite, returns may easily diverge. Instead if we impose a discount factor , and assume the reward function is uniformly bounded, returns would have riogous convergence guarantee.The value function
is the the expected return conditional on the current state. For an infinite-horizon time-homogeneous MRP, is defined as In this case is not time-dependent. Note that in our setting MDP and MRP are always time-homogenous, though time-inhomogeneous MDPs are also allowed in more general settings.The Q-function2
is the expected return conditional on the current state and action. For an infinite-horizon time-homogeneous MDP, is defined as The Q-function differs from the value function in that it additionally conditions on the action.
- 标题: Markov Decision Processes and Policy Evaluation
- 作者: RPChe_
- 创建于 : 2026-07-03 00:00:00
- 更新于 : 2026-07-05 02:39:21
- 链接: https://rpche-6626.github.io/2026/07/03/RL/MDP/
- 版权声明: 本文章采用 CC BY-NC-SA 4.0 进行许可。