about batch RL
“collect experience” and “infer policy” stages share a policy pool and transition memory
plus interest in offline RL
2 separate objectives:
- given a fixed batch of data, how to train best possible policy ("optimal inference")
- given an inference process, what's the minimal set of data to get optimally performing policy ("optimal collection")
collection phase no longer cares about regret during training
surrogate objective L_I defined in terms of collected data
we could focus on collecting dataset that allows rapidly learning a new task offline
also applied on skill-learning architectures - like hierarchical RL