Training Verifiers to Solve Math Word Problems

separately training verifiers which get a bunch of generated candidate solutions, select the highest ranked one

got a small model to perform as well as a fine-tuned 30x larger model

new dataset “GSM8k” of ~8k grade school math word problems; high diversity

“verification scales more effectively with increased data than a fine-tuning baseline”

questions:

  • how do they generate training data for the verifiers?

"catastrophic errors": autoregressive models can't correct themselves once they make a mistake; fine-tuning alone has poor scaling (with parameter count, etc.)

  • dropout is a strong regularizer - important for finetuning & verification performance

Measuring Mathematical Problem Solving With the MATH Dataset is significantly harder than GSM8k

also CommonsenseQA, LogiQA datasets

recent work on math problem solving had specialized encoder-decoders, like BERT

idea from Measuring Mathematical Problem Solving With the MATH Dataset: corpus from Khan Academy & Mathematica scripts

fine-tuning uses the same objective as GPT-3 in Language Models are Few-Shot Learners

test-time performance test: sample low-temperature solution, check final answer

verification: sample multiple high-temperature solutions, assign scores, output highest ranked solution

verifiers: training signal determined only based on whether the solution reached the correct final answer

I wonder whether richer feedback would be useful here - e.g., “here is the logic error”

to avoid arithmetic errors, all models are trained to “use a calculator” - calculation annotations

verifiers are actually trained to check solution accuracy after each token.

using “residual dropout”, from Attention Is All You Need, at 20%