AIXI-like RL algorithm

would be nice to have it.

  • take as many samples as are available
  • compute as long as I'll allow it, pick a policy as good as possible

I wonder:

  • what's the performance of RL on different network sizes? does a too big network hurt? does ResNet help? how about different architectures?