Reinforcement Learning/Temporal Difference Learning

Temporal difference (TD) learning is a central and novel idea in reinforcement learning.


 * It is a combination of Monte Carlo and dynamic programing methods
 * It is a Model-free learning algorithm
 * It both bootstraps (builds on top of previous best estimate) and samples
 * It can an be used for both episodic or infinite-horizon (non-episodic) domains
 * Immediately updates estimate of V after each $$(s, a, r, s')$$
 * Requires the system to be Markovian
 * Biased estimator of value function but often much lower variance than Monte Carlo estimator
 * Converges to true value in finite state cases, but does not always converge with infinite number of states (known as function approximation)

Algorithm Temporal Difference Learning TD(0)
TD learning can be applied as a spectrum between pure Monte Carlo and dynamic programing, but the simplest TD learning is as follows


 * Input: $$\alpha$$
 * Initialize $$V^\pi(s) = 0, \forall s \in S$$
 * Loop
 * Sample tuple $$(s_t, a_t, r_t, s_{t+1})$$
 * Update $$V^\pi(s_t)$$$$V^\pi(s) = V^\pi(s) + \alpha \big(\underbrace{[r_t + \gamma V^\pi(s_{t+1})]}_{\text{TD target}} - V^\pi(s) \big)$$

Temporal difference error is defined as$$\delta_t = r_t + \gamma V^\pi(s_{t+1}) - V^\pi(s_t)$$

n-step Return
$$n=0:$$ $$G^{(0)}={R_{t} + \gamma V (s_{t+1})}$$ is TD(0)

$$n=1:$$ $$G^{(1)}={R_{t} + \gamma R_{t+1} + \gamma^2 V (s_{t+1})}$$

and so on up to infinity

$$n=\infty:$$ $$G^{(\infty)}={R_{t} + \gamma R_{t+1} + ... + \gamma^{T-1} V (s_{t+T-1})}$$ is MC

Is defined as n-step learning TD(n)

$$V^\pi(s) = V^\pi(s) + \alpha \big[G^{(n)} - V^\pi(s_{t+1})]\big)$$