SQIL: Imitation Learning via Reinforcement Learning with Sparse Rewards

https://openreview.net/forum?id=S1xKd24twB

Basic principle: do RL. encourage return to explored states by giving reward of 1 to taking a demonstrated action in a seen state, 0 otherwise.

SQIL = “soft Q imitation learning”

Theoretically, we show that SQIL can be interpreted as a regularized variant of BC that uses a sparsity prior to encourage long-horizon imitation.

This is easier than Generative adversarial imitation learning - it does not even do any adversarial training. Like SQIL, adversarial methods try to also encourage return to known states.

But if you have an environment like MineRL Basalt competition, how can the agent ever return to a known state once it leaves?

They do Soft Q learning - fill replay buffer with expert demonstrations, then add 0 for anything else. This is fine - Soft Q learning is off-policy. We can apply this in stochastic envs / continuous states.

Each batch is then 50% expert demonstrations, 50% agent's interactions.

This uses “Soft Bellman error”. Let's first read Reinforcement learning with deep energy-based policies

I get the main idea of Soft Q learning, but the sampling in continuous spaces is aaaa.

SQIL is similar to regularized Behavior cloning with added L1 sparsity penalty (\(\sum_{s,a} | Q_\theta(s,a) - \gamma\mathbb{E}_{s'} \left[\log\sum_{a'} \exp(Q_\theta(s',a'))\right] |\)). Why does L1 penalty encourage sparsity? This also adds information about transition dynamics. If we square the L1 sparsity penalty (to make it differentiable), it looks like soft Q Bellman squared error.

You can use an off-policy actor-critic method (like Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor) in continuous action spaces.

Generative adversarial imitation learning paper also trains with rewards, but the rewards are trained, not constant 0/1. Discriminator-actor-critic: Addressing sample inefficiency and reward bias in adversarial imitation learning addresses a bias in Generative adversarial imitation learning

There were 2 other methods that use constant instead of learned rewards: Sample efficient imitation learning for continuous control, Random expert distillation: Imitation learning via expert policy support estimation

Possible extension for the future: try to also recover expert's reward function