Transformers learn in-context by gradient descent

auto-created for paper ID 2212.07677