RL 强化学习笔记

Xiaopeng Xu included in Technology

Jun 18, 2021 3481 words 7 minutes

Contents

RL 基础

2 多臂赌博机 (K-arm bandit)

只有动作 (action) 和对应的收益 (rewards)。无状态 (states)。
动作价值函数
增量式实现

多臂赌博机的$$\varepsilon$$- 贪心算法 (Espilon greedy)

探索 (exploration) vs 开发 (eploitation)

乐观初始值 Optimistic Initial Values– 鼓励在开始的时候多做探索
基于置信度上界 (Upper-Confidence-Bound, UCB) 的动作选择 – 鼓励在探索时多选择低频的动作 (action)。
- At is action, Nt(a) is number of action a. t is time-step. c controls degree of exploration.
Gradient Bandit Algorithms– Add action preference
- Where H(a) is preferences to action a.

3 有限马尔可夫决策过程 (Finite Markov Decision Processes)

有状态 (states)、动作 (**actions)**和收益 (rewards)。相比多臂赌博机增加了状态。
定义：
- In a finite MDP, the sets of states, actions, and rewards (S, A, and R) all have a finite number of elements. In this case, the random variables Rt and St have well defined discrete probability distributions dependent only on the preceding state St-1and action At-1.

“智能体-环境”交互接口 The Agent–Environment Interface

目标 (Goals)、收益 (Rewards)、回报 (Returns) 和分幕 (Episodes)

目标 Goals 和收益 Rewards
- That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward).
回报 Returns、收益 Rewards、折扣率 (discount rate) 和分幕 (episodes)
- R is reward, G is return, r is discount rate.
- This approach above makes sense in applications in which there is a natural notion of final time step, that is, when the agent–environment interaction breaks naturally into subsequences, which we call episodes
  - such as plays of a game, trips through a maze, or any sort of repeated interaction.

策略 (Policies) 和价值函数 (Value Functions)

状态价值函数 (State-value function)：the value function of a state s under a policy pai：
动作价值函数 (Action-value function): the value function of taking action a in state s under a policy pai, ()
最优状态价值函数 (Optimal state-value function)
最优动作价值函数(Optimal action-value function)

贝尔曼方程 (Bellman function)

状态价值函数State-value function
最优状态价值函数Optimal state-value function
最优动作价值函数Optimal action-value function

4 动态规划 (Dynamic Programming)

DP algorithms are obtained by turning Bellman equations (such as optimal state-value function and optimal action-value function) to update rules for improving approximations of the desired value functions in RL.

4.1 策略评估 (Policy Evaluation) (或预测 (Prediction))

迭代策略评估公式
迭代策略评估算法

4.2 策略改进 (Policy Improvement)

迭代策略改进公式

4.3 策略迭代 (Policy Iteration)

Sequence of monotonically improving policies and value functions:
- where denotes a policy evaluation and denotes a policy improvement
策略迭代算法

4.4 价值迭代 (Value Iteration) 80

价值迭代公式
价值迭代算法

4.6 广义策略迭代 (Generalized Policy Iteration, GPI) 84

表格型方法 (Tabular methods)

5 蒙特卡洛方法 (Monte Carlo Methods)

能够使用历史经验或数据进行新策略的学习

5.1 蒙特卡洛预测 (Monte Carlo Prediction) 90

给定策略 pai，计算状态价值函数，即蒙特卡洛策略评估。

5.2 动作价值的蒙特卡洛估计 (Monte Carlo Estimation of Action Values) 的问题，某些状态-动作二元组 (s, a) 可能永远也不会访问到.

5.3 蒙特卡洛控制 Monte Carlo Control 95

探索初始值蒙特卡洛算法 ( Monte Carlo Exploring Starts)

5.4 没有试探性出发假设的蒙特卡洛控制 Monte Carlo Control without Exploring Starts 98

同轨策略 (on-policy) vs 离轨策略 (off-policy)

How can we avoid the unlikely assumption of exploring starts?
- The only general way to ensure that all actions are selected infinitely often is for the agent to continue to select them. There are two approaches to ensuring this, resulting in what we call on-policy methods and off-policy methods.
同轨策略 (On-policy methods): attempt to evaluate or improve the policy that is used to make decisions
离轨策略 (off-policy methods): evaluate or improve a policy different from that used to generate the data.

首次访问 MC 控制算法（同轨策略）

5.5 基于重要度采样的离轨策略 (Off-policy Prediction via Importance Sampling) 101

How can they learn about the optimal policy while behaving according to an exploratory policy?
- A more straightforward approach is to use two policies, one that is learned about and that becomes the optimal policy, and one that is more exploratory and is used to generate behavior.
- Target policy: the policy being learned about.
- Behavior policy: the policy used to generate behavior.
In this case we say that learning is from data “off” the target policy, and the overall process is termed off-policy learning.
The relative probability of the trajectory under the target and behavior policiess (the importance-sampling ratio) is
加权重要度采样 (Weighted importance sampling)

5.6 增量式实现 (Incremental Implementation) 107

加权重要度采样公式：

MC 预测或策略评估（离轨策略）

5.7 离轨策略蒙特卡洛控制 (Off-policy Monte Carlo Control) 108

6 时序差分学习 (Temporal difference learning, TD)

相比蒙特卡洛方法，每步都更新而不是等到分幕结束。计算速度更快，准确性通常也更高。

6.1 时序差分预测 TD Prediction 117

A simple every-visit Monte Carlo method suitable for nonstationary environments
The simplest TD method
Comparison：
- Monte Carlo methodsmust wait until the end of the episode to determine the increment to V (St) (only then is Gt known), the target is Gt.
- TD methods need to wait only until the next time step. At time t + 1 they immediately form a target and make a useful update using the observed reward Rt+1 and the estimate V (St+1). whereas the target is Rt+1 + V (St+1).

表格 TD (0) 来估计 Vpai

MC 采用 6.3 来故居，TD 用 6.4 来故居。
TD error
MC error 可以写为 TD error 的和：

6.2 时序差分预测方法的优势 (Advantages of TD Prediction Methods) 122

相比 DP 方法，TD 不需要一个收益和概率分布的模型。
相比 MC 方法，
- TD 是一种在线 (on line) 算法，不需要等到幕 (episode) 结束再更新。
- TD 通常能够更快的收敛。
  - 随机游走 (Random Walk)

6.3 TD(0) 的最优性 (Optimality of TD(0)) 124

根据图所示的均方根误差度量，batch TD 方法能够比固定步长 MC 表现得更好

6.4 Sarsa：同轨策略下的时序差分控制 (TD Control) 127

情节 (episode) 由状态和状态-动作对的交替序列组成：
对应的动作价值函数公式：

Sarsa 算法

6.5 Q 学习 (Q-learning)：离轨策略下的时序差分控制 129

公式

悬崖徒步 (Cliff Walking)

6.6 期望 Sarsa (Expected Sarsa) 131

公式
悬崖徒步问题中的性能比较

6.7 最大化偏差 (Bias) 与双学习 (Double Learning) 133

双 Q 学习

Q 学习有最大化偏差问题，因此引入双 Q 学习
- 某个状态下的真实动作价值函数 q(s, a) 全为 0，但 Q 学习取最大值时对 q(s, a) 的估计会是有些大于 0，有些小于 0，从而产生最大化偏差。
公式

7 n 步自举法

TD 和 MC 通常不是最好的方法，走向了两个极端。n 步自举法是两种的折中，通常有更好的性能。

7.1 n 步时序差分 (TD) 预测 140

公式：

n 步 TD

7.2 n 步 Sarsa 144

公式
- 其中 t+n >= T。

n 步 Sarsa

7.3 n 步离轨策略学习 146

公式
重要度采样

离轨策略 n 步 Sarsa

7.5 不需要使用重要度采样的离轨策略学习方法 150

公式(和 n 步 SARSA 一样)

n步树回溯算法

8 基于表格型方法的规划和学习

增加了规划和学习，相比之前增加了从模型学习的反馈循环。

8.1 模型（Models）和规划 (Planning) 157

模型：指智能体可以用来预测环境对其动作的反馈的任何事物。
分布模型 (distribution model)：生成对所有可能结果的描述及其对应的概率分布的模型。
样本模型 (sample model)：从所有可能行中生成一个确定性的结果，这个结果通过概率分布采样得到。
规划：此处代表任何以环境模型 (model) 为输入，并生成或改进与其进行交互的策略的计算过程
状态空间规划算法的通用结构
- 1）所有的状态空间规划算法都会利用价值函数作为改善策略的关键中间步骤
- 2）通过仿真经验的回溯操作来计算价值函数。

单步 Q 规划

8.2 Dyna：集成在一起的规划、动作和学习 159

通过模型的学习，增加新的反馈循环过程
通用 Dyna 架构

表格型 Dyna-Q 算法

8.3 当模型错误的时候 164

Dyna-Q+: 为鼓励测试长期未采取的动作，为期增加额外收益

8.4 优先遍历 Prioritized Sweeping 166

优先级遍历的思想：
- 1）通过从目标状态反向搜索，是的遍历的范围更为集中；
- 2）为解决反响推演时，范围迅速扩大的问题，根据某种迫切性对更新进行优先级排序。
  - 例如：若某个“状态-动作”而远足在更新之后的价值变化是不可忽略的，则将其放入优先队列维护。这个队列按照价值改变的大小来进行优先级排序。