Reinforcement Learning/Value Iteration

= Policy iteration vs Value iteration =

Policy iteration and value iteration will converge to the same optimal policy.
 * Policy iteration computes optimal value and policy
 * Value iteration:
 * Maintain optimal value of starting in a state s if have a finite number of steps $$k$$ left in the episode
 * Iterate to consider longer and longer episodes

= Algorithm = Value function of a policy is the solution to the Bellman equation$$V^\pi (s) = R^\pi(s) + \gamma \sum_{s' \in S} P^\pi (s'|s) V^\pi(s')$$Bellman-backup operator is an operator that is applied to a value function and returns a new value function. The Bellman-backup operator improves the value if it is possible$$\mathcal{B}V (s) = \max_a R(s,a) + \gamma \sum_{s' \in S} P^\pi (s'|s,a) V(s')$$$$\mathcal{B}V$$ yields a value function over all states $$s$$.