Implement toy “learn to order” model

Idea: learn how to construct implicit reward function from pairwise rankings, a-la Learning to summarize from human feedbackDeep Reinforcement Learning from Human Preferences

example:

  • input: images (e.g., MNIST), predict “is this a bigger / smaller number”