Reinforcement Learning/Markov Decision Process

Markov Decision Process (MDP) is Markov Chain + Rewards function + Actions.

The Markov Decision Process is reduced to Markov Rewards process by choosing a "policy" that specifies the action taken given the state, $$\pi(s)$$.

Definition
A Markov decision process is a 4-tuple $$(S, A, P_a, R_a)$$, where


 * $$S$$ is a finite set of states,
 * $$A$$ is a finite set of actions (alternatively,  is the finite set of actions available from state ),
 * $$P_a(s,s') = \text{Pr}(s_{t+1} = s' | s_t = s, a_t = a)$$ is the probability that action $$a$$ in state $$s$$ at time $$t$$ will lead to state $$s'$$ at time $$t+1$$,
 * $$R_a(s,s')$$ is the immediate reward (or expected immediate reward) received after transitioning from state $$s$$ to state $$s'$$, due to action $$a$$

(Note: The theory of Markov decision processes does not state that $$S$$ or $$A$$ are finite, but the basic algorithms below assume that they are finite.)

Policy Specification
A policy if a function $$\pi$$ that specifies the action $$a = \pi(s)$$ that the decision maker will choose when it is in state $$s$$.

Once a Markov decision process is combined with a policy, this fixes the action for each state and the resulting combination behaves like a Markov chain$$\Pr(s_{t+1}=s' \mid s_t = s, a_t=a) = \Pr(s_{t+1}=s' \mid s_t = s, a_t=\pi(s))$$

The goal is to choose a policy $$\pi$$ that will maximize some cumulative stochastic rewards function.

Typically the expected the cumulative reward is a discounted sum over a potentially infinite horizon:


 * $$\mathbb{E}[\sum^{\infty}_{t=0} {\gamma^t R_{a_t} (s_t, s_{t+1})}] $$   (where we choose $$a_t = \pi(s_t)$$, i.e. actions given by the policy). And the expectation is taken over $$s_{t+1} \sim P_{a_t}(s_t,s_{t+1})$$

where $$\ \gamma \ $$ is the discount factor satisfying $$0 \le\ \gamma\ \le\ 1$$, which is usually close to 1. (For example, $$ \gamma = 1/(1+r) $$ for some discount rate r.)

Because of the Markov property, the optimal policy for this particular problem can indeed be written as a function of $$s$$ only, as assumed above.

The discount-factor motivates the decision maker to favor taking actions early, rather not postpone them indefinitely.