Prioritized Experience Replay

Keeps an experience replay buffer, prioritized by size of TD error (on last update of that experience). Neuroscience studies suggest hippocampus of rodents does experience replay (during awake resting or sleep), and replays more frequently sequences associated with rewards or high TD error. This paper is about how to most effectively use a given memory buffer.

“Greedy TD error prioritization”: stores last encountered TD error along each transition in replay memory, and always replays transition with largest TD error (and applies Q-learning update on it). New transitions are given maximum priority. But this means transitions with low TD error won't be replayed for a long time, and it's sensitive to noise (e.g., with stochastic rewards).

To avoid expensive sweeps over the entire replay memory, TD errors are only updated for the transitions that are replayed,.

Experiences are sampled with probability \(P(i) \propto \frac{p_i^\alpha}{\sum_k p_k^\alpha}\). \(\alpha\) is a hyperparameter choosing how hard to prioritize.

Direct prioritization: \(p_i = |\delta_i| + \varepsilon\).

Rank-based prioritization: \(p_i = \frac{1}{\mathrm{rank}(i)}\), ranked by \(|\delta_i|\); should be less sensitive to outliers.

Prioritized experience replay makes the distribution we draw experience from not correspond to the on-policy distribution. To fix this, we need to apply an importance-sampling weight (\(1/P(i)\)). The importance-sampling weight is exponentiated by \(\beta\in[0;1]\), linearly annealed from 0 to 1.

One potential extension: when we see large TD error, it might also be good to replay experiences that led to that state, to also update their Q estimates.