Collect & Infer - a fresh look at data-efficient Reinforcement Learning

about batch RL

“collect experience” and “infer policy” stages share a policy pool and transition memory

plus interest in offline RL

2 separate objectives: 

  • given a fixed batch of data, how to train best possible policy ("optimal inference")
  • given an inference process, what's the minimal set of data to get optimally performing policy ("optimal collection")

collection phase no longer cares about regret during training

surrogate objective L_I defined in terms of collected data

we could focus on collecting dataset that allows rapidly learning a new task offline

also applied on skill-learning architectures - like hierarchical RL