Reinforcement Learning

Reinforcement learning (RL) is the process of optimizing rewards in a sequential decision making process under uncertainty.

Challenges in RL:
Creating a reinforcement learning algorithm is specifically challenging because of the following items:


 * Planning: decisions involve reasoning about not just immediate benefit of a decision but also its longer term benefits
 * Temporal credit assignment is hard: it is unclear what actions has lead to rewards
 * Exploration: agent only learns from what it has done in the past, so it needs to explore new things to learn more.

Imitation learning reduces RL to supervised learning by learning from experience of others (learning from watching others do something)

Sequential decision making under uncertainty
RL applies to sequential decision making under uncertainty with the objective of optimizing reward.

Here we assume the system is Markov:

The actions taken and the future of the system is only dependent on the current state of the system (world/agent) instead of the whole history.


 * From another point of view, the state has enough aggregated information from the history of the system to predict the future. Any system can become Markovian if we consider the whole history as the state.

$$p(s_{t+1} | s_{t}, a_t) = p(s_{t+1} | h_{t}, a_t) $$

Types of Sequential Decision Processes:


 * Bandits: actions have no influence on the next observation. Rewards are immediate.
 * Markov Decision Process (MDP) and Partially Observable Markov Decision Process (POMDP): actions influence future observations. In RL algorithm for these types of systems credit assignment and strategic actions may be needed.

Components of RL algorithm

 * Model: representation of how world changes in response to agent’s actions.
 * Transition / dynamics model is the next state transition probability given the current state and action taken$$P(s_{t+1} = s' | s_t = s, a_t = a)$$
 * The dynamics model might be known (model-based) or unknown (model-free) in the RL algorithm.


 * Policy ($$\pi$$) determines how the agent chooses actions
 * Deterministic policy $$\pi(s) = a$$
 * Stochastic policy $$\pi(a|s) = P(a_t=a | s_t = s)$$
 * The basic problem of reinforcement learning is to find the policy that returns the maximum value.
 * Value function: Future rewards from being in a state and/or action when following a particular policy.
 * Reward model predicts immediate reward based on state and actions taken: $$r(s_t = s, a_t = a) = \mathbb{E} [r_t | s_t = s, a_t = a ]$$
 * The Reward model itself can be deterministic or stochastic, therefore we use an expected value.
 * The value function is calculated using the reward model and policy as the expected immediate and future rewards:$$V^\pi(s_t = s) = \mathbb{E} [ r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots | s_t = s ] $$
 * Discount factor $$\gamma$$ weighs immediate vs future rewards

Policy Evaluation
Policy evaluation is the process of evaluating the expected reward for a state $$s$$, given the transition model, policy, and immediate reward function.

We should note that, policy evaluation is related to the optimization problem of finding the optimal policy but does not specifically involve the optimization state.

Policy evaluation builds on top of the following hierarchy of conceptsː


 * 1) Markov Chain ː no actions, no rewards
 * 2) Markov Rewards Process (MRP)ː Markov Chain + Rewards
 * 3) Markov Decision Process (MDP) ː MRP + Actions (read more about MDP here)
 * 4) Policy Evaluation : MDP + Policy

State evaluation
In an MRP we can calculate a state value for each state which is called state evaluation. There are a number of methods to calculate for state evaluation:


 * 1) Monte Carlo simulation
 * 2) Analytic solution
 * 3) Dynamic programming / Iterative solution

Furthermore, as explained here, an MDP is reduced to an MRP by choosing a specific policy. Therefore, we can think of the state evaluation process as a policy evaluation process which determines what is the expected value of rewards for each state in the MRP under a certain policy.

Dynamic programing / iterative solution
Policy evaluation is possible using an iterative process, through the following algorithm


 * Initialize $$V^\pi_0(s) = 0 $$ for all $$s$$


 * For $$k = 1$$ until convergence, for all $$s$$ in $$S$$$$V_k^\pi (s) = r(s, \pi(s)) + \gamma \sum_{s' \in S} p(s'|s, \pi(s)) V_{k-1}^\pi(s')$$
 * $$V^\pi_k(s) $$ is exact value of $$k$$-horizon value of state $$s$$ under policy $$\pi$$.
 * $$V^\pi_k(s) $$ is an estimate of infinite horizon value of state $$s$$ under policy $$\pi$$ $$V^\pi(s) = \mathbb{E}_\pi [G_t|s_t=s] \approx \mathbb{E}_\pi [r_t + \gamma V_{k-1}|s_t=s]$$

In dynamic programming, we bootstrap, or estimate the value of the next state using our current estimate, $$V^\pi_{k-1}(s')$$.

Monte Carlo policy evaluation

 * In this method, we simply simulate many trajectories (decision processes), and calculate the average returns.


 * The error of calculated reward reduces with $$1/\sqrt{N}$$, where $$N$$ is the number of trajectories created.
 * This method can be used only for episodic decision processes, meaning that the trajectories are finite and terminates after a number of states.
 * The evaluation does not require formal derivation of dynamics and rewards models.
 * This method does NOT assume states to be Markov.
 * Generally a high variance estimator . Reducing the variance can require a lot of data. Therefore, in cases where data is expensive to acquire or the stakes are high, MC may be impractical.

There are different types of Monte Carlo policy evaluation:


 * 1) First-visit Monte Carlo
 * 2) Every-visit Monte Carlo
 * 3) Incremental Monte Carlo

Read more about different types of /Monte Carlo Policy Evaluation/.



Temporal Difference
$$V^\pi(s) = V^\pi(s) + \alpha \big([r_t + \gamma V^\pi(s_{t+1})] - V^\pi(s) \big)$$
 * Combination of Monte Carlo and dynamic programing methods
 * Model-free
 * Bootstraps (builds on top of previous best estimate) and samples
 * Can be used for both episodic or infinite-horizon (non-episodic) domains
 * Biased estimator of value function
 * Immediately updates estimate of V after each $$(s, a, r, s')$$

Read more about /Temporal Difference Learning/.

On-policy versus Off-policy learning
On-policy learning:


 * Direct experience
 * Learn to estimate and evaluate a policy from experience obtained from following that policy

Off-policy learning:


 * Learn to estimate and evaluate a policy using experience gathered from following a different policy
 * In a sense, off-policy learning is like an extrapolation problem that we are trying to make an educated guess based on the available data and predict the situation that has not been experienced before.

Markov Decision Process Control
MDP Control means computing the optimal policy with the maximum value$$\pi^*(s) = \arg \max_\pi V^\pi(s)$$There is mathematical proof that such optimal value function exists and is unique, however there might be multiple policies that lead to similar optimal value function.

Optimal policy for a MDP in an infinite horizon problem is:


 * Deterministic
 * Stationary (does not depend on time step)
 * Unique? Not necessarily, may have state-actions with identical optimal values

Policy search Algorithms

 * Brute force: generally very expensive. Number of deterministic policies $$|A|^{|S|}$$
 * Policy Iteration: more efficient than brute force
 * Value Iteration:
 * Idea: maintain optimal value of starting in a state $$s$$ if have a finite number of steps $$k$$ left in the episode
 * We iterate to consider longer and longer episodes and eventually converge to the result from policy iteration.

= Index =

= References =

__INDEX__ __NEWSECTIONLINK__
 * Major source of the content presented on this page and subpages is from Stanford Reinforcement Learning course