Scaling Laws

Premise:

Larger models presented new problems:

It’s infeasible to train multiple large models to find the best hyperparameters. (i.e. grid search for hyperparameters wouldn’t work).
If we’ve been given a fixed budget for compute, should we increase the model size or the number of training steps?
We don’t know when to stop training.

Is there a connection between loss, model size and compute steps?

Background:

Well researchers at OpenAI tried to answer that. They trained differently sized models with different amounts of parameters, for different lengths of time. They found:

Smaller models don’t have the capacity to take advantage of the extra compute
Larger models get lower losses, but only after the extra compute.
There’s a shoulder, at which point these models are not training optimally.

So again, they asked themselves:

Is there a connection between loss, model size and compute steps, specifically for optimality?

Implementing that into real training:

We can’t just train large models willy-nilly. So they claimed:

Test loss is a power law function of model size and compute. Therefore, small models can fit the constants. This can then be extrapolated to larger sizes.

They discovered:

~/leocamacho.co

Get Around

🧠 EdinburghAI

🛠️ Projects

📝 Essays

Contact Me

📧 Email

💼 LinkedIn

🐦 Twitter

Scaling Laws

Premise:

Background:

Implementing that into real training:

Graph View

Table of Contents