Scaling Laws for Large Language Models

April 10, 2024 · 2 minute read ·

Key Ideas

  • There seems to be an upper cap on LLM performance if compute budget is kept fixed, i.e. capacity of a model
  • Increasing either data or model parameters alone is not enough to improve performance
  • Various studies have found quantified relationships between data size, model size and compute budget which can be used to inform how much resource to use when training an LLM.

Notes

Training Tradeoffs

Factors informing training are

  • Dataset Size (D # of training tokens)
  • Model Size (N # of model parameters)
  • Compute Budget (C = Flops(N, D))

To solve: argmin_{N, D} L(N, D) s.t. FLOPS(N, D) = C where L(N, D) = A/(N^alpha) + B/(D^beta) + E

Kaplan Scaling Laws

Findings from Kaplan et al., 2020

  • Performance depends strongly on scale and weakly on model shape
  • Increasing both dataset and model size is key to improved performance. Increasing one while keeping the other fixed leads to diminished returns (assuming uncapped compute)
  • Larger models are more sample efficient
  • Prioritize increasing model size over data size

Issue

  • Same learning rate schedule was used for all training runs, regardless of batch size

Chinchilla Scaling Laws

Findings from Hoffman et al., 2022

  • Fixed the learning rate schedules
  • Diff from Kaplan: Increase the data and model with the same factor
  • Based off a fixed compute budget, they found two linear relationships - one for model size and one for dataset size

Resources

Twitter, Facebook