Idea: learn how to construct implicit reward function from pairwise rankings, a-la Learning to summarize from human feedback, Deep Reinforcement Learning from Human Preferences
example:
- input: images (e.g., MNIST), predict “is this a bigger / smaller number”