深度强化学习

一、PPO [2017]

近年来，人们提出了几种不同的方法从而用神经网络函数近似器来进行强化学习。领先的竞争者是 deep Q-learning （《Human-level control through deep reinforcement learning》）、平凡的策略梯度policy gradient方法（《Asynchronous methods for deep reinforcement learning》），以及 trust region / natural 策略梯度方法（《Trust region policy optimization》）。然而，在开发一种 scalable （用于大型模型和并行实现）、数据高效和鲁棒（无需超参数调优即可在各种问题上取得成功）的方法方面仍有改进空间：
- Q-learning （带 function approximation ）在许多简单的问题上都失败了，而且人们对其理解不深。
- 平凡的策略梯度方法的数据效率和鲁棒性很差。
- trust region policy optimization: TRPO 相对复杂，而且与包含噪声（如 dropout ）或参数共享（在策略函数和价值函数之间参数共享、或与辅助任务之间参数共享）的架构不兼容。
论文 《Proximal Policy Optimization Algorithms》 试图通过引入一种算法来改善目前的状况，这种算法可以达到 TRPO 的数据效率和可靠性能，同时只使用一阶优化first-order optimization 。论文提出了一个具有 clipped probability ratio 的新目标，它形成了对策略性能的悲观估计（即，下限）。为了优化策略，作者在来自策略的数据采样、以及对被采样的数据进行若干个 epoch 的优化之间交替进行。
论文的实验比较了各种不同版本的代理目标 surrogate objective 的性能，发现具有 clipped probability ratio 的版本表现最好。论文还将 PPO 与之前文献中的几种算法进行了比较：
- 在 continuous control 任务上，PPO 比 baseline 算法表现得更好。
- 在 Atari 任务上，PPO 的表现（就样本复杂度而言）明显优于 A2C ，与ACER 相似，但是 PPO 要简单得多。

1.1 背景：策略优化

策略梯度方法：策略梯度方法的工作原理是，计算策略梯度的估计值并应用到随机梯度上升算法中。最常用的梯度估计器的形式是：
$\begin{matrix} (1) & \hat{\vec{g}} = {\hat{E}}_{t} [\nabla_{θ} \log π_{θ} (a_{t} ∣ s_{t}) {\hat{A}}_{t}] \end{matrix}$
其中：
- $\pi_\theta$ 为一个随机策略。
- $\hat A_t = Q_\pi(s_t,a_t) - V_\pi(s_t)$ timestep $t$ advantage function $Q_\pi(s,a)$ $V_\pi(s)$ 为状态价值函数。
- $\hat{\mathbb E}_t[\cdot]$ 为在一个有限的 batch 上样本的经验均值。这里的样本是在一个交替执行采样和优化的算法中获取。
推导过程：
$U_t$ $t$ $t$ $S_t,A_t,S_{t+1},A_{t+1},\cdots$ $t$ $s_t$ $a_t$ $U_t$ 是随机变量。
$Q_\pi(s_t,a_t) = \mathbb E[U_t\mid S_t=a_t,A_t=a_t]$ $(s_t,a_t)$ $V_\pi(s_t) = \mathbb E_{A_t\sim \pi(\cdot\mid s_t;\theta)}[Q_\pi(s_t,A_t)]$ 。
$S$ $V_\pi(S)$ 的均值应该相当大。因此定义目标函数：
$\begin{matrix} (2) & J (θ) = E_{S} [V_{π} (S)] \end{matrix}$
因此策略学习可以描述为最优化问题：
$\begin{matrix} (3) & max_{θ} J (θ) \end{matrix}$
策略梯度为：
$\begin{matrix} (4) & \nabla_{θ} J (θ) \end{matrix}$
根据：
$\begin{matrix} (5) & \begin{matrix} \frac{\partial V_{π} (s)}{\partial θ} = \frac{\partial}{\partial θ} \sum_{a \in A} π (a ∣ s; θ) \times Q_{π} (s, a) = \sum_{a \in A} \frac{\partial (π (a ∣ s; θ) \times Q_{π} (s, a))}{\partial θ} \\ = \sum_{a \in A} \frac{\partial π (a ∣ s; θ)}{\partial θ} \times Q_{π} (s, a) + E_{A \sim π (\cdot ∣ s; θ)} [\frac{\partial Q_{π} (s, A)}{\partial θ}] \\ = \sum_{a \in A} π (A ∣ s; θ) \times \frac{1}{π (A ∣ s; θ)} \times \frac{\partial π (a ∣ s; θ)}{\partial θ} \times Q_{π} (s, a) + E_{A \sim π (\cdot ∣ s; θ)} [\frac{\partial Q_{π} (s, A)}{\partial θ}] \\ = \sum_{a \in A} π (A ∣ s; θ) \times \frac{\partial \log π (a ∣ s; θ)}{\partial θ} \times Q_{π} (s, a) + E_{A \sim π (\cdot ∣ s; θ)} [\frac{\partial Q_{π} (s, A)}{\partial θ}] \\ = E_{A \in π (\cdot ∣ s; θ)} [\frac{\partial \log π (A ∣ s; θ)}{\partial θ} \times Q_{π} (s, A)] + E_{A \sim π (\cdot ∣ s; θ)} [\frac{\partial Q_{π} (s, A)}{\partial θ}] \\ ≃ E_{A \in π (\cdot ∣ s; θ)} [\frac{\partial \log π (A ∣ s; θ)}{\partial θ} \times Q_{π} (s, A)] \end{matrix} \end{matrix}$
因此：
$\begin{matrix} (6) & \nabla_{θ} J (θ) = \nabla_{θ} E_{S} [V_{π} (S)] = E_{S} [\nabla_{θ} V_{π} (S)] ≃ E_{S} E_{A} [\frac{\partial \log π (A ∣ S; θ)}{\partial θ} \times Q_{π} (S, A)] \end{matrix}$
advantage function $\tilde A(S,A) = Q_\pi(S,A) - V_\pi(S)$ $Q_\pi(S,A)$ ，因此得到：
$\begin{matrix} (7) & \nabla_{θ} J (θ) = E_{S, A} [\nabla_{θ} \log π (A ∣ S; θ) \times \tilde{A}] \end{matrix}$
$\nabla_\theta J(\theta)$ ，则得到：
$\begin{matrix} (8) & \hat{\vec{g}} = {\hat{E}}_{t} [\nabla_{θ} \log π_{θ} (a_{t} ∣ s_{t}) {\hat{A}}_{t}] \end{matrix}$
$\hat A_t$ $\tilde A$ 的经验估计。
policy gradient estimator $\hat{\mathbf{\vec g}}$ 是通过对目标函数进行微分得到：
$\begin{matrix} (9) & L^{PG} (θ) = {\hat{E}}_{t} [\log π_{θ} (a_{t} ∣ s_{t}) {\hat{A}}_{t}] \end{matrix}$
trajectory $\mathcal L^\text{PG}$ 进行多步优化是很有吸引力的，但这样做的理由并不充分。根据经验，这种优化经常导致破坏性的 large policy update （见实验部分；结果没有显示，但与 "no clipping or penalty" 的 setting 相似或更差）。
Trust Region 方法：在 TRPO 方法中，一个目标函数（即，surrogate objective ）被最大化，同时约束了 policy update 的大小。具体而言：
$\begin{matrix} (10) & \begin{matrix} max_{θ} {\hat{E}}_{t} [\frac{π_{θ} (a_{t} ∣ s_{t})}{π_{θ_{old}} (a_{t} ∣ s_{t})} {\hat{A}}_{t}] \\ subject to {\hat{E}}_{t} [KL [π_{θ_{old}} (\cdot ∣ s_{t}), π_{θ} (\cdot ∣ s_{t})]] \leq δ \end{matrix} \end{matrix}$
$\theta_\text{old}$ 为更新之前的策略参数，KLKL $\delta$ 为一个正的超参数。
$\theta$ $\theta_\text{old}$ $\max_\theta \hat{\mathbb E}_t\left[\frac{\pi_\theta(a_t\mid s_t)}{\pi_{\theta_\text{old}}(a_t\mid s_t)}\hat A_t\right]$ $\max_\theta \hat{\mathbb E}_t\left[ \pi_\theta(a_t\mid s_t) \hat A_t\right]$ 。从下式可以看出二者区别：
$\begin{matrix} (11) & \nabla_{θ} \log π_{θ} = \frac{\nabla_{θ} π_{θ}}{π_{θ}}, \nabla_{θ} \frac{π_{θ}}{π_{θ_{old}}} = \frac{\nabla_{θ} π_{θ}}{π_{θ_{old}}} \end{matrix}$
$\theta_\text{old}$ $\pi_\theta$ 。
在对目标函数进行线性近似、以及对约束进行二次近似之后，这个问题可以有效地使用共轭梯度算法进行近似解决。
证明 TRPO 的理论实际上建议使用惩罚项而不是约束，也就是说，解决无约束的优化问题：
$\begin{matrix} (12) & max_{θ} {\hat{E}}_{t} [\frac{π_{θ} (a_{t} ∣ s_{t})}{π_{θ_{old}} (a_{t} ∣ s_{t})} {\hat{A}}_{t} - β \times KL [π_{θ_{old}} (\cdot ∣ s_{t}), π_{θ} (\cdot ∣ s_{t})]] \end{matrix}$
$\beta$ 为 KL 惩罚项的系数。
这源于这样一个事实：即某个 surrogate objectiveKL $\max[\text{KL}(\cdot,\cdot)]$ KL $\mathbb E_t[\text{KL}(\cdot,\cdot)]$ $\pi$ 的性能形成了一个下限（即一个悲观的下界）。
TRPO $\beta$ $\beta$ 值从而在单个问题中表现良好，因为在学习过程中问题的特性会发生变化。因此，为了实现我们一阶算法的目标（即，模仿对 TRPOmonotonic improvement $\beta$ 、以及使用 SGD 优化这个无约束优化问题的目标函数是不够的，还需要进行额外的修改。

1.2 PPO

1.2.1 Clipped Surrogate Objective

$r_t(\theta)$ 为 probability ratio：
$\begin{matrix} (13) & r_{t} (θ) = \frac{π_{θ} (a_{t} ∣ s_{t})}{π_{π_{old}} (a_{t} ∣ s_{t})} \end{matrix}$
$r(\theta_\text{old}) = 1$ 。
因此 TRPO 最大化一个 surrogate objective：
$\begin{matrix} (14) & L^{CPI} (θ) = {\hat{E}}_{t} [\frac{π_{θ} (a_{t} ∣ s_{t})}{π_{θ_{old}} (a_{t} ∣ s_{t})} {\hat{A}}_{t}] = {\hat{E}}_{t} [r_{t} (θ) {\hat{A}}_{t}] \end{matrix}$
上标 CPI 指的是保守的策略迭代 conservative policy iteration: CPI （《Approximately optimal approximate reinforcement learning》）。
$\mathcal L^\text{CPI}(\theta)$ large policy update $r_t (\theta)$ 远离 1 的策略更新。我们提出的主要目标函数为：
$\begin{matrix} (15) & \begin{matrix} {\tilde{r}}_{t} (θ) = clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) \\ L^{CLIP} (θ) = {\hat{E}}_{t} [min (r_{t} (θ) {\hat{A}}_{t}, {\tilde{r}}_{t} (θ) {\hat{A}}_{t})] \end{matrix} \end{matrix}$
$\epsilon$ $\epsilon = 0.2$ 。
min $\mathcal L^\text{CPI}(\theta)$ $\text{clip}\left(r_t(\theta), 1-\epsilon, 1+\epsilon\right)\hat A_t$ 。因此，通过裁剪 probability ratiosurrogate objective $\text{L}^\text{CLIP}(\theta)$ $r_t(\theta)$ $[1-\epsilon, 1+\epsilon]$ 的动机。最后，我们取 clipped objective 和 unclipped objective 的最小值，所以最终的目标函数是 unclipped objective 的下限（即，悲观的下界）。
注意：
- $\theta$ $\theta_\text{old}$ $r_t(\theta) = 1$ $\mathcal L^\text{CLIP}(\theta) = \mathcal L^\text{CPI}(\theta)$ 。
- $\theta$ $\theta_\text{old}$ $\mathcal L^\text{CLIP}(\theta)$ $\mathcal L^\text{CPI}(\theta)$ 变得不同。
$\mathcal L^\text{CLIP}(\theta)$ $t$ $r$ probability ratio $r$ $1-\epsilon$ $1+\epsilon$ advantage $A$ （这里指的是 advantage function ，而不是 Action ）是正还是负。
surrogate objective $\mathcal L^\text{CLIP}(\theta)$ 的另一个直观来源。它显示了当我们沿着策略更新方向插值时，几个目标是如何变化的，这是由 continuous control problemproximal policy optimization $\mathcal L^\text{CLIP}(\theta)$ $\mathcal L^\text{CPI}(\theta)$ 的下限，针对策略更新过大带有惩罚。

1.2.2 Adaptive KL Penalty Coefficient

另一种方法，可以作为 clipped surrogate objective 的替代品，或者作为它的补充，是使用作用在 KLKL $d_\text{targ}$ 。在我们的实验中，我们发现 KL 惩罚的表现比 clipped surrogate objective 更差。然而，我们在这里包括它，因为它是一个重要的 baseline。
在这个算法的最简单的实现中，我们在每次策略更新中执行以下步骤：
- 执行 minibatch SGD 若干个 epoch 从而优化 KL-penalized objective ：
  $\begin{matrix} (16) & L^{KLPEN} (θ) = {\hat{E}}_{t} [\frac{π_{θ} (a_{t} ∣ s_{t})}{π_{θ_{old}} (a_{t} ∣ s_{t})} {\hat{A}}_{t} - β \times KL [π_{θ_{old}} (\cdot ∣ s_{t}), π_{θ} (\cdot ∣ s_{t})]] \end{matrix}$
- $d = \hat{\mathbb E}_t\left[\text{KL}[\pi_{\theta_\text{old}}(\cdot\mid s_t),\pi_\theta(\cdot\mid s_t)]\right]$ ：
  - $d\lt d_\text{targ}/1.5$ $\beta\leftarrow \beta/2$ 。
  - $d\gt d_\text{targ}\times 1.5$ $\beta\leftarrow \beta\times 2$ 。
$\beta$ KL $d_\text{targ}$ $\beta$ KL $d_\text{targ}$ 。
上面的超参数 1.52 $\beta$ $\beta$ 。

1.2.3 PPO

surrogate loss $\mathcal L^\text{CLIP}(\theta)$ $\mathcal L^\text{KLPEN}(\theta)$ $\mathcal L^\text{PG}(\theta)$ ，并对新的目标进行多步随机梯度上升。
大多数用于方差缩减 variance-reduced 的 advantage-function estimatorsstate-value function $V(s)$ ，例如，generalized advantage estimation （《High-dimensional continuous control using generalized advantage estimation》）、或者 《Asynchronous methods for deep reinforcement learning》中的 inite-horizon estimators 。如果使用一个在策略函数和价值函数之间共享参数的神经网络架构，我们必须使用一个损失函数来结合 policy surrogate 和 value function error 。这个目标函数可以通过增加熵奖励 entropy bonus 来进一步增强，从而确保充分的探索，正如过去的工作 《Asynchronous methods for deep reinforcement learning》 和 《Simple statistical gradient-following algorithms for connectionist reinforcement learning》 所建议的。结合这些项，我们得到以下目标，它在每轮迭代中都被最大化（或被近似地最大化）：
$\begin{matrix} (17) & L_{t}^{CLIP+VF+S} (θ) = {\hat{E}}_{t} [\underset{L^{CLIP}}{\underset{⏟}{min (r_{t} (θ) {\hat{A}}_{t}, {\tilde{r}}_{t} (θ) {\hat{A}}_{t})}} - c_{1} \times \underset{L^{VF}}{\underset{⏟}{{(V_{θ} (s_{t}) - V_{t}^{targ})}^{2}}} + c_{2} S [π_{θ}] (s_{t})] \end{matrix}$
$\pi_\theta(\cdot)$ $V_\theta(\cdot)$ $\theta$ $V_t^\text{targ}$ $t$ 时刻的目标价值（例如，来自于一个独立训练的价值网络）。
其中：
- $c_1,c_2$ 都是作为系数的超参数。
- $\mathcal L^\text{CLIP}$ $S$ entropy bonus $\pi_\theta(\cdot\mid s)$ 的分布尽可能分散）。
  $S[\pi_\theta](s_t) = \sum_{a\in \mathcal A} -\pi_\theta(a\mid s_t)\times \log \pi_\theta(a\mid s_t)$ 。
有一种策略梯度的实现方式，在 《Asynchronous methods for deep reinforcement learning》RNN $T$ $T$ 远小于 episode 长度），并使用收集的样本进行更新，这个时间段被称作 trajectory segment 。这种方式需要一个 advantage estimatorestimator $T$ 的信息。 《Asynchronous methods for deep reinforcement learning》 中使用的 estimator 为：
$\begin{matrix} (18) & {\hat{A}}_{t} = - V (s_{t}) + {\overset{˘}{r}}_{t} + γ {\overset{˘}{r}}_{t + 1} + \dots + γ^{T - t + 1} {\overset{˘}{r}}_{T - 1} + γ^{T - t} V (s_{T}) \end{matrix}$
$\breve r_t$ $t$ $t$ $T$ 的 trajectory segment 中位于 [0, T] 之间的时间索引。
$\gamma \breve r_{t+1} +\cdots+\gamma^{T-t+1}\breve r_{T-1} + \gamma^{T-t}V(s_T)$ $Q(s_t,a_t)$ $t$ 时刻的动作导致后续的一连串收益。
推广这一选择，我们可以使用 generalized advantage estimation 的截断版本：
$\begin{matrix} (19) & {\hat{A}}_{t} = δ_{t} + (γ λ) δ_{t + 1} + \dots + (γ λ)^{T - t + 1} δ_{T - 1} \end{matrix}$
$\delta_t = \breve r_t + \gamma V(s_{t+1}) - V(s_t)$ 。
$\lambda=1$ 时，这个generalized advantage estimation 就简化回原始的形式。
trajectory segment $T$ proximal policy optimization: PPO $N$ 个（并行的）actors 中的每一个都收集 T timesteps 的数据。然后，我们在这 NT timesteps 的数据上构建 surrogate loss ，并用 minibatch SGDAdam $K$ 个 epochs 。
PPO Actor-Critic Style：
- 输入：
  - $\pi_{\theta_\text{old}}$ 。
  - $O$ actor $N$ trajectory segment $T$ minibatch size $M$ 。
- $\pi_\theta$
- 算法步骤：
  - 外层迭代：iteration=1,2,...O ：
    - 内层迭代：actor = 1,2,..., N：
      $\pi_{\theta_\text{old}}$ ，指定 T timesteps 。
      advantage estimates $\hat A_1,\cdots,\hat A_T$ 。
    $\theta$ surrogate loss $K$ 个 epochminibatch size $M\le NT$ 。
    $\theta_\text{old}\leftarrow \theta$ 。

1.3 实验

1.3.1 Surrogate Objective 的比较

首先，我们在不同的超参数下比较几个不同的 surrogate objective：
$\begin{matrix} (20) & \begin{matrix} No clipping or penalty : L_{t} (θ) = E_{t} [r_{t} (θ) {\hat{A}}_{t}] \\ Clipping : L_{t} (θ) = E_{t} [min (r_{t} (θ) {\hat{A}}_{t}, {\tilde{r}}_{t} (θ) {\hat{A}}_{t})] \\ KL penalty : L_{t} (θ) = E_{t} [r_{t} (θ) {\hat{A}}_{t} - β \times KL [π_{θ_{old}}, π_{θ}]] \end{matrix} \end{matrix}$
- KL penalty $\beta$ target KL value $d_\text{targ}$ 的自适应系数。
- 我们也尝试过在 log space （而不是线性空间）中进行剪裁，但发现其性能并没有提高。
因为我们正在为每个算法搜索超参数，所以我们选择了一个计算量小的 benchmark 来测试算法。也就是说，我们使用了在OpenAI Gym 中实现的 7 个模拟机器人任务，它们使用了 MuJoCo1M timesteps $\epsilon$ KL penalty $\beta, d_\text{targ}$ ）都是我们超参数搜索而来，其他的超参数在下表中提供。
我们使用了一个全连接的MLP 来表示策略函数。这个 MLP 有两个隐藏层，隐层维度为 64tanh $c_1$ 是 irrelevant 的），我们也不使用 entropy bonus 。
$c_1 = 0$ $V_\theta(s_t)$ 。
每个算法都在所有 7 种环境下运行，每种环境下有 3 个随机种子。我们通过计算最后 100 个 episodes 的平均总奖励来评估算法的每次运行。我们对每个环境的分数进行 shift 和 scale ，使随机策略给出的分数为 0、最好策略给出的分数为 1 ，并对 21 次运行进行平均，从而为每个算法 setting 产生一个单一的标量。
结果如下表所示。请注意，对于 No clipping or penalty 的 setting ，得分是负的，因为在其中一个环境（half cheetah）上它导致了一个非常负的分数，这比最初的随机策略更糟糕。
Clipping 的效果在这三者之间最好。

1.3.2 Continuous Domain 中的算法比较

接下来，我们将 PPO （具有 clipped surrogate objective）与文献中的其他几种方法进行比较，这些文献中的方法被认为对 continuous problems 有效。我们与以下算法的超参数调优的实现进行了比较：
- trust region policy optimization （《Trust region policy optimization》）。
- cross-entropy method: CEM（《Learning Tetris using the noisy cross-entropy method》）。
- 带有自适应步长的平凡策略梯度。
- A2C （《Asynchronous methods for deep reinforcement learning》）。
- 带有 trust region 的 A2C （《Sample Efficient Actor-Critic with Experience Replay》）。
A2C 代表 advantage actor critic ，是 A3C 的一个同步版本，我们发现它的性能与异步版本（即 A3CPPO $\epsilon = 0.2$ 。我们看到，PPO 在几乎所有的 continuous control environments 上都优于以前的方法。

1.3.3 Continuous Domain 中的示例

为了展示 PPO 在高维连续控制问题 high-dimensional continuous control problems 上的性能，我们对涉及三维人形机器人的一组问题进行了训练，在这些问题中，机器人必须跑动、转向、并从地面上站起来，并且同时可能被方块击中。我们测试的三个任务是：
- RoboschoolHumanoid ：仅向前运动。
- RoboschoolHumanoidFlagrun ：每隔 200 个 timestep 则 position of target 随机变化，或只要达到目标则 position of target 就随机变化。
- RoboschoolHumanoid-FlagrunHarder：机器人被方块砸中，需要从地上站起来。
Figure 5 是学习策略的静止画面，Figure 4 是这三个任务的学习曲线。下表中提供了超参数。在同期的工作中， 《Emergence of Locomotion Behaviours in Rich Environments》 使用PPO 的自适应 KL 变体来学习三维机器人的运动策略。

1.3.4 Atari Domain 中的算法比较

我们还在 Arcade Learning Environment 这个 benchmark 上运行 PPO ，并与 A2C （《Asynchronous methods for deep reinforcement learning》）和 ACER （《Sample Efficient Actor-Critic with Experience Replay》）的 well-tuned 实现进行比较。对于这三种算法，我们使用了与 《Asynchronous methods for deep reinforcement learning》 相同的策略网络架构。下表中提供了 PPO 的超参数。对于其他两种算法，我们使用了经过调优的超参数，从而在该 benchmark 上的性能最大化。
下表和下图提供了所有 49 个游戏的结果和学习曲线。
我们考虑了以下两个评分指标：
- 整个训练期间每个 episode 的平均奖励（有利于快速学习）。
- 训练的最后 100 个 episodes 的每个 episode 平均奖励（有利于最终表现）。
下表显示了每种算法 "获胜" 的游戏数量，我们通过对三次试验的平均得分指标来计算胜利者。