Premise:

Larger models presented new problems:

  • It’s infeasible to train multiple large models to find the best hyperparameters. (i.e. grid search for hyperparameters wouldn’t work).
  • If we’ve been given a fixed budget for compute, should we increase the model size or the number of training steps?
  • We don’t know when to stop training.

Is there a connection between loss, model size and compute steps?

Background:

Well researchers at OpenAI tried to answer that. They trained differently sized models with different amounts of parameters, for different lengths of time. They found:

  • Smaller models don’t have the capacity to take advantage of the extra compute
  • Larger models get lower losses, but only after the extra compute.
  • There’s a shoulder, at which point these models are not training optimally.

So again, they asked themselves:

Is there a connection between loss, model size and compute steps, specifically for optimality?

Implementing that into real training:

We can’t just train large models willy-nilly. So they claimed:

Test loss is a power law function of model size and compute. Therefore, small models can fit the constants. This can then be extrapolated to larger sizes.

They discovered: