separately training verifiers which get a bunch of generated candidate solutions, select the highest ranked one
got a small model to perform as well as a fine-tuned 30x larger model
new dataset “GSM8k” of ~8k grade school math word problems; high diversity
“verification scales more effectively with increased data than a fine-tuning baseline”
questions:
- how do they generate training data for the verifiers?
"catastrophic errors": autoregressive models can't correct themselves once they make a mistake; fine-tuning alone has poor scaling (with parameter count, etc.)
- dropout is a strong regularizer - important for finetuning & verification performance
Measuring Mathematical Problem Solving With the MATH Dataset is significantly harder than GSM8k
also CommonsenseQA, LogiQA datasets
recent work on math problem solving had specialized encoder-decoders, like BERT
- Towards interpretable math word problem solving with operation-based formalisms
- Scalable integration of distributed and symbolic representations for reading comprehension
idea from Measuring Mathematical Problem Solving With the MATH Dataset: corpus from Khan Academy & Mathematica scripts
fine-tuning uses the same objective as GPT-3 in Language Models are Few-Shot Learners
test-time performance test: sample low-temperature solution, check final answer
verification: sample multiple high-temperature solutions, assign scores, output highest ranked solution
verifiers: training signal determined only based on whether the solution reached the correct final answer
I wonder whether richer feedback would be useful here - e.g., “here is the logic error”
to avoid arithmetic errors, all models are trained to “use a calculator” - calculation annotations
verifiers are actually trained to check solution accuracy after each token.
using “residual dropout”, from Attention Is All You Need, at 20%