Collect & Infer - a fresh look at data-efficient Reinforcement Learning

about batch RL

“collect experience” and “infer policy” stages share a policy pool and transition memory

plus interest in offline RL

2 separate objectives:

given a fixed batch of data, how to train best possible policy ("optimal inference")
given an inference process, what's the minimal set of data to get optimally performing policy ("optimal collection")

collection phase no longer cares about regret during training

surrogate objective L_I defined in terms of collected data

we could focus on collecting dataset that allows rapidly learning a new task offline

also applied on skill-learning architectures - like hierarchical RL