Reinforcement Learning/Policy iteration

Policy Iteration (PI) is one of the algorithms for finding the optimal policy (MDP control).

Policy iteration is a model-based algorithm.

The complexity of the algorithm is $$|A| \times |S| \times k$$ where $$k$$ is the number of iterations needed for convergence. Theoretically, the maximum number of iterations is $$|A|^{|S|}$$.

The algorithm converges to the global optimum.

State-action value Q
State-action value of a policy $$\pi$$, is calculated by taking the specified action $$a$$ immediately, then following the policy$$Q^\pi(s,a) = R(s,a) + \gamma \sum_{s' \in S} P(s'\mid s,a ) V^\pi(s')$$Here, $$R(s,a)$$ is the reward function in MDP and $$P(s'|s,a)$$ is the transition model.

Algorithm

 * Set $$i=0$$
 * Initialize $$\pi_0(s)$$ randomly for all states $$s$$
 * While $$i = 0$$ or $$|\pi_i - \pi_{i-1}|_1 > 0$$ (L1-norm, measures if the policy changed for any state):
 * Compute state-action value of a policy $$\pi_i$$, for all $$s \in S$$ and all $$a \in A$$ $$Q^\pi(s,a) = R(s,a) + \gamma \sum_{s' \in S} P(s'\mid s,a ) V^\pi(s')$$
 * Compute new policy $$\pi_{i+1}$$, for all $$s \in S$$ by choosing the action that returns the maximum state-action value for each specific state$$\pi_{i+1}(s) = \arg \max_a Q^{\pi_i}(s,a) \forall s \in S$$

Explanation
In each iteration, by definition we have$$\arg \max_a Q^{\pi_i}(s,a) \ge Q^{\pi_i}(s, \pi_i(s)) = V^{\pi_i}(s) \forall s \in S$$

Proof
$$\begin{align} V^{\pi_i}(s) \le & ~\max_a Q^{\pi_i}(s,a) \\ = & ~ \max_a R(s, a) + \gamma \sum_{s' \in S} P(s'|s,a) V^{\pi_i} (s') \\ = & ~ \max_a R(s, \pi_{i+1}(s)) + \gamma \sum_{s' \in S} P(s'|s,\pi_{i+1}(s)) V^{\pi_i} (s') \leftarrow \text{by definition the action with the maximum Q value is taken as the new policy} \\ \le & ~ \max_a R(s, \pi_{i+1}(s)) + \gamma \sum_{s' \in S} P(s'|s,\pi_{i+1}(s)) \Big[ \max_{a'} Q^{\pi_i} (s', a') \Big] \\ = & ~ \max_a R(s, \pi_{i+1}(s)) + \gamma \sum_{s' \in S} P(s'|s,\pi_{i+1}(s)) \Bigg[ R(s', \pi_{i+1}(s')) + \gamma \sum_{s \in S} P(s|s', \pi_{i+1}(s')) V^{\pi_i}(s'') \Bigg] \\ \vdots & \\ \le & ~ V^{\pi_{i+1}} \end{align} $$