Markov Decision Process is the threotical foundation of reincement learning. This installment introduces the foundamental settings of Reinforcement Learning.

The Complete Decision Process

We'll first formulate the complete decision process of RL problems.

Markov Decision Process

RL essentially seek to answer the question that how an agent learns to interact with the environment given certain rewards. Concretely, we'll use the Markov Decision Process model Where
- is the state measurable space, is the state space, is a sigma algebra on .
- is the action measurable space, is the action space, is a sigma algebra on .
- is the state transition kernel, also named dynamics model, for it characterizing the dynamics of states. For any fixed , is a probability measure on , which represents the distribution of the next state when the environment is at state and the agent takse action . And for fixed , the function is -measurable.
  
  We further require that be Markovian, that is for any history and , we have
- is the reward function on , also called reward model. For any fixed , represents the reward that the agent gains when taking action at state . Sometimes the reward function also depends on the next state , in this case will be on .
- is a discount factor used in the computation of returns, which we'll define later.

Policy

It could be seen that MDP defines the environment and interactions. We further define the policy of the agent
- The policy is a probability measure on , which represents the distribution of the action that the agent would take at state .
In this way, we can define the complete decision process that: First an initial state is given, then the agent takes action and gains reward , and the environment evolves to state , then repeats. The process can be shown as below Note that we actually impose Markov assumptions in the dynamics, action and reward models, though some are implicite. We require Markov property, without which the problem would be to too hard to handle. In addition, in some problems there might be different action spaces for different states, but we generally use a unified action space setting.

Optimization of RL

Someone says RL has different optimization aspect from ordinary supervised learning where typically a scalar function is minimized. In RL we instead seek to find an optimal policy, which might be more complicated. However I still think there should be no foundamental difference, because the modeling will be written as a rigorous optimization problem anyway. Concretely, there are main questions
1. How to define the optimality of a policy.
2. How to solve for the optimal policy.
In this installment we'll answer the first question. It is worth mentioning that in RL we also refer to "optimization" as "control". Because if we consider the optimization as a step-wise problem, we have to control each movement of the agent. I think this perspective may originate from robotics.

Evaluation of policies

Evaluation in RL particularly means to evaluate how good a given policy is. We'll characterize this using Markov Reward Process.

Markov Reward Process

MRP is basically an MDP with a given policy, which is defined as Where
- is the state measurable space.
- is the state transition kernel. For any , is a probability measure on , representing the distribution of next state given the current state . Compared with the dynamics model in MDP, is basically marginalized using policy , written as
- is the reward function on . Compared with the reward function in MDP, is basically marginalized using policy , written as
- is the dicount factor, same as MDP.
MRPs differ from MDPs in that they don't involve actions . And the transition kernel and reward function are marginalized using policy . This may sound as though some information is lost when taking expectations, but in fact we'll soon see that only the expectations are needed for the analysis. The state measurable space and the state transition kernel together form a Markov chain .

Return, Value function and Q-function

Most of the time we evaluate a policy using MRP, except for Q-function which depends on actions. Concretely, MRP is basically a Markov chain with reward functions. Here we set an initial state distribution , and let . In the perspective of stochastic process, MRP gives a infinite trajectory of random variables Where and . Sometimes we are interested in the problems with finite episode with horizon ¹, like a fixed horizon or certain stopping time , written as Such a episode has exactly transitions and rewards.

The return is the discounted sum of future rewards starting from a given time . The return of an MRP is defined as For a finite-horizon episode, the return is defined as Note that itself is a random variable. People may ask why is there a discount factor . Naturally, we hope to evaluate an episode by the sum over rewards. But when horizons are infinite, returns may easily diverge. Instead if we impose a discount factor , and assume the reward function is uniformly bounded, returns would have riogous convergence guarantee.
The value function is the the expected return conditional on the current state. For an infinite-horizon time-homogeneous MRP, is defined as In this case is not time-dependent. Note that in our setting MDP and MRP are always time-homogenous, though time-inhomogeneous MDPs are also allowed in more general settings.
The Q-function² is the expected return conditional on the current state and action. For an infinite-horizon time-homogeneous MDP, is defined as The Q-function differs from the value function in that it additionally conditions on the action.

Episode and Horizon are usual terms in stochastic process. An episode is one realization of the process from an initial state until termination. The horizon is the number of transition steps, as well as the length of the episode.↩︎
Q may stand for "Quality".↩︎

Markov Decision Processes and Policy Evaluation

The Complete Decision Process

Markov Decision Process

Policy

Optimization of RL

Evaluation of policies

Markov Reward Process

Return, Value function and Q-function