Reinforcement Learning/Monte Carlo Policy Evaluation

The goal is to estimate $$V^\pi(s)$$ by generating many episodes under policy $$\pi$$.


 * An episode is a series of states, actions, and rewards ($$s_1, a_1, r_1, s_2, a_2, r_2, \cdots$$) created for an Markov Decision Process (MDP) under policy $$\pi$$.


 * In this method, we simply simulate many trajectories (decision processes), and calculate the average returns.


 * The error of calculated reward reduces with $$1/\sqrt{N}$$, where $$N$$ is the number of trajectories created.
 * This method can be used only for episodic decision processes, meaning that the trajectories are finite and terminates after a number of states.
 * The evaluation does NOT require formal derivation of dynamics and rewards models.
 * This method does NOT assume states to be Markov.
 * Generally a high variance estimator . Reducing the variance can require a lot of data. Therefore, in cases where data is expensive to acquire or the stakes are high, MC may be impractical.

There are different types of Monte Carlo policy evaluation:


 * 1) First-visit Monte Carlo
 * 2) Every-visit Monte Carlo
 * 3) Incremental Monte Carlo

Algorithm:
Initialize $$N(s) = 0, G(s) = 0 \forall s \in S$$

Loop:


 * Sample episode $$i = s_{i,1}, a_{i,1}, r_{i,1},s_{i,2}, a_{i,2}, r_{i,2}, \cdots ,s_{i,T_i}$$
 * Define $$G_{i,t} = r_{i,t} + \gamma r_{i,t+1} + \gamma^2r_{i,t+2} + \cdots \gamma^{T_{i-1}} r_{i,T_i}$$ as return from time step $$t$$ onwards in $$i$$th episode
 * For each state $$s$$ visited in episode $$i$$
 * For first time $$t$$ that state $$s$$ is visited in episode $$i$$
 * Increment counter of total first visits: $$N(s) = N(s) + 1$$
 * Increment total return $$G(s) = G(s) + G_{i,t}$$
 * Update estimate $$V^\pi(s) = G(s)/N(s)$$

Properties:

 * $$V^\pi$$ first-time MC estimator is an unbiased estimator of true $$\mathbb{E}_\pi [G_t \mid s_t = s]$$. (Read more about Bias of an estimator).
 * By law of large numbers, as $$N(s) \rightarrow \infty, V^\pi(s) \rightarrow \mathbb{E}[G_t \mid s_t = s]$$

= Every-visit Monte Carlo =

Algorithm:
Initialize $$N(s) = 0, G(s) = 0 \forall s \in S$$

Loop:


 * Sample episode $$i = s_{i,1}, a_{i,1}, r_{i,1},s_{i,2}, a_{i,2}, r_{i,2}, \cdots ,s_{i,T_i}$$
 * Define $$G_{i,t} = r_{i,t} + \gamma r_{i,t+1} + \gamma^2r_{i,t+2} + \cdots \gamma^{T_{i-1}} r_{i,T_i}$$ as return from time step $$t$$ onwards in $$i$$th episode
 * For each state $$s$$ visited in episode $$i$$
 * For  every  time $$t$$ that state $$s$$ is visited in episode $$i$$
 * Increment counter of total first visits: $$N(s) = N(s) + 1$$
 * Increment total return $$G(s) = G(s) + G_{i,t}$$
 * Update estimate $$V^\pi(s) = G(s)/N(s)$$

Properties:

 * $$V^\pi$$ every-visit MC estimator is a biased estimator of true $$V^\pi(s) = \mathbb{E}_\pi [G_t \mid s_t = s]$$. (Read more about /Bias of an estimator/).
 * The every-visit MC estimator has MSE (variance + bias2) than first-visit estimator, because we collect way more data when we count every visit.
 * The every-visit estimator is a consistent estimator, meaning that the bias value consistently decreases with increasing number of simulated episodes. The bias of a consistent estimator asymptotically goes to zero with increasing number of sample size.

= Incremental Monte Carlo = Incremental MC policy evaluation is a more general form of policy evaluation that can be applied to both first-visit and every-visit policy evaluation algorithms.

The benefit of using incremental MC algorithm is that it can be applied to cases where the system is non-stationary. The algorithm does this by giving higher weight to newer data.

In both first-visit and every-visit MC algorithms the value function is updated by the following equation$$V^\pi(s) = V^\pi(s) \frac{N(s)-1}{N(s)} + \frac{G_{i,t}(s)}{N(s)} = V^\pi(s) + \frac{1}{N(s)} \big(G_{i,t}(s) - V^\pi(s) \big)$$This equation is easily derivable by looking value of $$V^\pi(s)$$, $$G(s)$$, and $$N(s)$$ each time the value function is updated.

If we change the update equation to the following we arrive at the incremental MC algorithm which can have both first-visit and every-visit variations$$V^\pi(s) = V^\pi(s) + \alpha \big(G_{i,t}(s) - V^\pi(s) \big)$$If we set $$\alpha = 1/N(s)$$, we arrive at the original first-visit or every-visit MC algorithms, but if set $$\alpha > 1/N(s)$$ we have an algorithm that gives more weight to the newer data and is more suitable for non-stationary domains.