AIXI-like RL algorithm

would be nice to have it.

I wonder:

what's the performance of RL on different network sizes? does a too big network hurt? does ResNet help? how about different architectures?