SQIL: Imitation Learning via Reinforcement Learning with Sparse Rewards

https://openreview.net/forum?id=S1xKd24twB

Basic principle: do RL. encourage return to explored states by giving reward of 1 to taking a demonstrated action in a seen state, 0 otherwise.

SQIL = “soft Q imitation learning”

Theoretically, we show that SQIL can be interpreted as a regularized variant of BC that uses a sparsity prior to encourage long-horizon imitation.

This is easier than Generative adversarial imitation learning - it does not even do any adversarial training. Like SQIL, adversarial methods try to also encourage return to known states.

They do Soft Q learning - fill replay buffer with expert demonstrations, then add 0 for anything else. This is fine - Soft Q learning is off-policy. We can apply this in stochastic envs / continuous states.

Each batch is then 50% expert demonstrations, 50% agent's interactions.

This uses “Soft Bellman error”. Let's first read Reinforcement learning with deep energy-based policies

I get the main idea of Soft Q learning, but the sampling in continuous spaces is aaaa.

SQIL is similar to regularized Behavior cloning with added L1 sparsity penalty (\(\sum_{s,a} | Q_\theta(s,a) - \gamma\mathbb{E}_{s'} \left[\log\sum_{a'} \exp(Q_\theta(s',a'))\right] |\)). Why does L1 penalty encourage sparsity? This also adds information about transition dynamics. If we square the L1 sparsity penalty (to make it differentiable), it looks like soft Q Bellman squared error.

You can use an off-policy actor-critic method (like Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor) in continuous action spaces. 

Generative adversarial imitation learning paper also trains with rewards, but the rewards are trained, not constant 0/1. Discriminator-actor-critic: Addressing sample inefficiency and reward bias in adversarial imitation learning addresses a bias in Generative adversarial imitation learning

There were 2 other methods that use constant instead of learned rewards: Sample efficient imitation learning for continuous controlRandom expert distillation: Imitation learning via expert policy support estimation

Possible extension for the future: try to also recover expert's reward function