Why is ADAM loss-scale invariant?
used in
Scaling Laws for Reward Model Overoptimization