Scaling laws show that model performance improves predictably with more data and compute.

Let’s see what that looks like in practice. Imagine we’re training a small transformer model, say, 100 million parameters. We feed it a dataset of 10 billion tokens. We train it for 100,000 steps, using a batch size of 1024. The loss might be around 3.5.

Now, we double the dataset to 20 billion tokens, keeping everything else the same. The loss drops to about 3.2.

What if we double the compute instead? We keep the 10 billion tokens but double the training steps to 200,000. The loss also drops, perhaps to 3.3.

The magic of scaling laws is that these improvements aren’t random. They follow predictable power-law relationships. For a given model architecture and compute budget, there’s an optimal distribution of parameters and data.

The core problem these laws address is how to efficiently allocate limited resources (compute, data, parameters) to achieve the best possible model performance. Before scaling laws, model development was more experimental, with less clear guidance on how to improve performance beyond a certain point.

Internally, these laws emerge from the statistical properties of neural networks. As models get larger and are trained on more data, they are able to learn more complex patterns and representations of the underlying data distribution. The loss function, typically cross-entropy for language models, quantifies how well the model predicts the next token. Scaling laws demonstrate that this prediction error decreases smoothly as model size and data size increase, provided the training is done optimally.

The key levers you control are:

  • Model Size (N): The number of parameters in the model. Larger models can capture more intricate relationships.
  • Dataset Size (D): The number of tokens in the training corpus. More data exposes the model to a wider variety of linguistic phenomena.
  • Compute ©: The total amount of floating-point operations (FLOPs) used for training. This is often a proxy for the number of training steps and batch size.

A common finding is that the optimal number of training tokens scales roughly as $D \propto N^{0.7}$. This implies that as you increase model size, you don’t need to increase the dataset size proportionally; a slightly slower increase is sufficient. Similarly, the optimal compute scales as $C \propto N^{2.4}$, meaning compute requirements grow faster than model size.

The practical implication is that if you have a fixed compute budget, you can determine the ideal model size and dataset size to maximize performance. For example, if you have $C$ FLOPs, you can calculate the optimal $N$ and $D$ that satisfy the scaling relationships.

One significant insight is that for a given amount of compute, it’s often better to train a larger model for fewer steps on more data than a smaller model for more steps on less data. Specifically, if you have a fixed compute budget $C$, and you’re considering two configurations: $(N_1, D_1)$ and $(N_2, D_2)$, where $N_2 > N_1$. If $C \approx N_1^2 D_1 \approx N_2^2 D_2$ (a simplification), the scaling laws suggest that the configuration with the larger model $(N_2)$ will likely achieve lower loss, even if $D_2$ is not vastly larger than $D_1$. This is because the exponent in the compute scaling law is higher than that for dataset scaling, indicating that model size has a more pronounced effect on compute requirements for optimal training.

Future models will likely be much larger, but the rate of increase in parameter count might be tempered by the need for massive datasets and compute budgets that grow even faster. The push will be towards finding the most efficient point on the scaling curve for a given resource constraint.

The next challenge is understanding how these scaling laws interact with different model architectures and training methodologies.

Want structured learning?

Take the full Llm course →