Proximal Policy Optimization Algorithms

Paper for proximal policy optimization

Alternates between sampling data from environment and optimizing “surrogate” objective with SGD

“PPO has some of the benefits of Trust Region Policy Optimization (algorithm), but it's simpler to implement, more general and has better sample complexity”

“Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates”

cites “Trust Region Policy Optimization (paper) / natural policy gradient methods” - 

DQN works well on discrete action spaces. Vanilla policy gradient has bad data efficiency. Trust Region Policy Optimization (paper) is complicated.

Proximal policy optimization only uses first-order optimization. (So I guess Trust Region Policy Optimization (paper) uses higher-order optimization?)

We propose a novel objective with clipped probability ratios, which forms a pessimistic estimate (i.e., lower bound) of the performance of the policy. To optimize policies, we alternate between sampling data from the policy and performing several epochs of optimization on the sampled data.

On Atari, PPO is better than A2C and similar to Sample Efficient Actor-Critic with Experience Replay but simpler.

Usual gradient estimator for policy gradient methods is \(\mathop{\mathbb{E}}[\nabla_\theta \log \pi_\theta(s_t,a_t) \cdot \hat{A}_t]\)., so we can differentiate the pseudo-loss \(\mathop{\mathbb{E}} [\log\pi_\theta(s_t,a_t) \cdot \hat{A}_t]\). It's appealing to do a few steps of optimization on this objective, but it's not well-justified.

In Trust Region Policy Optimization (paper), we optimize a surrogate objective \(\mathop{\mathbb{E}}\left[\frac{\pi_\theta(s_t,a_t)}{\pi_{\theta_\text{old}} (s_t, a_t) } \cdot \hat{A}_t\right]\), subject to limit on KL divergence between old and new policy. If we approximate the objective and the constraint, it can be solved with conjugate gradient algorithm. The surrogate objective comes from Approximately Optimal Approximate Reinforcement Learning.

Learn conjugate gradient algorithm

PPO uses a “clipped surrogate objective”. Denote \(r_t(\theta)\) the probability ratio \(\frac{\pi_\theta(s_t, a_t)}{\pi_{\theta_\text{old} } }\)  (so \(r_t(\theta_\text{old}) = 1\)). TRPO maximizes the surrogate objective \(\mathop{\mathbb{E}}[r_t(\theta) \cdot \hat{A}_t]\). Without a constraint, this would lead to a too large policy update, so Proximal policy optimization penalizes changes to policy that move \(r_t(\theta)\) far from 1.

PPO loss is: \(\mathop{\mathbb{E}} \left[ \min\{ r_t(\theta) \cdot \hat{A}_t, \mathrm{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon) \cdot \hat{A}_t \} \right]\), where \(\varepsilon\approx 0.2\). The \(\min\) makes this a pessimistic objective - a lower bound on unclipped objective. We ignore the change in the probability ratio when if would make the objective improve, and include it if it makes the objective worse.

In each step, PPO runs policy in a bunch of timesteps, computes advantage estimates, then optimizes combination of:

  • Clipped loss
  • Bonus for entropy in policy
  • Squared loss of value estimator

Used 1M timesteps of training on MuJoCo tasks. Policy was tanh MLP with 64x2 hidden units, outputting mean of a Gaussian distribution.

Environments were: HalfCheetah, Hopper, InvertedDoublePendulum, InvertedPendulum, Reacher, Swimmer, and Walker2d, all “-v1”

Would be good to now go and read Trust Region Policy Optimization (paper) because it might have more to say about the “surrogate objective”.

2022-02-06 22:26 re-read after Trust Region Policy Optimization (paper), makes more sense.