Scaling laws are empirical relationships showing that neural network performance improves as a predictable power law function of data size, model size, or compute budget. For GNNs, this means you can forecast how much a 10x increase in training data or a 4x increase in hidden dimension will improve your model before actually running the experiment. This turns model development from guesswork into planning.
The power law relationship
Across many domains (language, vision, and now graphs), performance follows the same general form:
# The scaling law relationship
# L(N) = a * N^(-alpha) + L_inf
#
# L(N) = loss at scale N
# N = data size, model size, or compute
# a = scaling coefficient
# alpha = scaling exponent (0.05 to 0.5 typically)
# L_inf = irreducible loss (noise floor)
# Example: data scaling for node classification
# With 10K nodes: AUROC = 0.72
# With 100K nodes: AUROC = 0.78
# With 1M nodes: AUROC = 0.83
# Each 10x data increase yields ~5-6% improvement
import numpy as np
def predict_performance(current_n, current_perf, target_n, alpha=0.1):
"""Predict performance at larger scale."""
ratio = target_n / current_n
improvement = (ratio ** alpha - 1) * current_perf
return current_perf + improvementThe power law means returns diminish but never stop. Every doubling of data or model size yields a consistent percentage improvement.
Three scaling axes for GNNs
Data scaling: more nodes and edges
Adding more data to a graph means more nodes (entities), more edges (relationships), or both. For relational data, this is the most efficient scaling axis. Each new row in a table adds a node with its full relational context. The marginal information per additional node is high because it brings its neighborhood connections.
On RelBench, going from 1M to 10M training rows improves AUROC by 3-5% across tasks, with the improvement consistent and predictable from the scaling curve fit.
Model scaling: wider and deeper
GNN model scaling has an important asymmetry:
- Width scaling (hidden dimension): Follows standard power law. Going from 64 to 256 hidden dimension yields consistent improvement. No ceiling until compute becomes limiting.
- Depth scaling (number of layers): Hits over-smoothing after 3-5 layers for standard message passing. Graph transformers push this ceiling higher but it still exists. Depth scaling is fundamentally different from language models.
Compute scaling: more training steps
Given fixed data and model size, more training steps improve performance up to the convergence point. The optimal allocation balances data, model, and compute according to Chinchilla-like scaling relationships.
Practical applications of scaling laws
Scaling laws answer four critical questions:
- Should I collect more data? If your scaling curve shows alpha = 0.2 for data, doubling data yields about 15% loss reduction. If that is worth the collection cost, do it.
- Should I use a bigger model? If you are in the model-limited regime (loss flattens when training longer but improves with more parameters), scale up the model.
- How much will this cost? Scaling curves let you estimate the compute budget needed to reach a target performance level before starting the experiment.
- When do I stop? As you approach L_inf (the irreducible noise floor), further scaling yields diminishing returns. The scaling curve tells you when you are within 90% of the achievable performance.
Scaling laws for foundation models
Foundation models bet heavily on scaling laws. The argument is: pre-training on 100M rows costs $X but yields a model that transfers to N tasks, each of which would cost $Y to build individually. If N * Y > X, the foundation model is more cost-effective.
Scaling laws make this calculation concrete. KumoRFM's performance scales predictably with pre-training data size, and the improvement transfers proportionally to downstream tasks. This makes the pre-training investment quantifiable.