What are scaling laws for graph neural networks?

Scaling laws are empirical relationships showing that GNN performance (measured by loss or accuracy) improves as a power law function of dataset size, model size, or compute budget. Doubling data might improve AUROC by a predictable amount, allowing you to forecast performance before committing resources.

Do GNNs follow the same scaling laws as language models?

The general pattern is similar (power law improvement), but the exponents differ. GNNs tend to scale more favorably with graph size (more nodes and edges) than with model depth (more layers), due to over-smoothing. Width scaling (hidden dimension) follows similar laws to language models.

How can scaling laws guide model development decisions?

Scaling laws tell you whether to invest in more data or a bigger model. If you are in the data-limited regime, collecting more training data is more cost-effective than increasing model size. If you are in the model-limited regime, scaling up architecture is the right move. Scaling curves make this visible before you commit resources.

Scaling Laws for GNNs: How Performance Improves with More Data | Kumo.ai

Scaling laws are empirical relationships showing that neural network performance improves as a predictable power law function of data size, model size, or compute budget. For GNNs, this means you can forecast how much a 10x increase in training data or a 4x increase in hidden dimension will improve your model before actually running the experiment. This turns model development from guesswork into planning.

The power law relationship

Across many domains (language, vision, and now graphs), performance follows the same general form:

scaling_law.py

# The scaling law relationship
# L(N) = a * N^(-alpha) + L_inf
#
# L(N)   = loss at scale N
# N      = data size, model size, or compute
# a      = scaling coefficient
# alpha  = scaling exponent (0.05 to 0.5 typically)
# L_inf  = irreducible loss (noise floor)

# Example: data scaling for node classification
# With 10K nodes:  AUROC = 0.72
# With 100K nodes: AUROC = 0.78
# With 1M nodes:   AUROC = 0.83
# Each 10x data increase yields ~5-6% improvement

import numpy as np

def predict_performance(current_n, current_perf, target_n, alpha=0.1):
    """Predict performance at larger scale."""
    ratio = target_n / current_n
    improvement = (ratio ** alpha - 1) * current_perf
    return current_perf + improvement

The power law means returns diminish but never stop. Every doubling of data or model size yields a consistent percentage improvement.

Three scaling axes for GNNs

Data scaling: more nodes and edges

Adding more data to a graph means more nodes (entities), more edges (relationships), or both. For relational data, this is the most efficient scaling axis. Each new row in a table adds a node with its full relational context. The marginal information per additional node is high because it brings its neighborhood connections.

On RelBench, going from 1M to 10M training rows improves AUROC by 3-5% across tasks, with the improvement consistent and predictable from the scaling curve fit.

Model scaling: wider and deeper

GNN model scaling has an important asymmetry:

Width scaling (hidden dimension): Follows standard power law. Going from 64 to 256 hidden dimension yields consistent improvement. No ceiling until compute becomes limiting.
Depth scaling (number of layers): Hits over-smoothing after 3-5 layers for standard message passing. Graph transformers push this ceiling higher but it still exists. Depth scaling is fundamentally different from language models.

Compute scaling: more training steps

Given fixed data and model size, more training steps improve performance up to the convergence point. The optimal allocation balances data, model, and compute according to Chinchilla-like scaling relationships.

Practical applications of scaling laws

Scaling laws answer four critical questions:

Should I collect more data? If your scaling curve shows alpha = 0.2 for data, doubling data yields about 15% loss reduction. If that is worth the collection cost, do it.
Should I use a bigger model? If you are in the model-limited regime (loss flattens when training longer but improves with more parameters), scale up the model.
How much will this cost? Scaling curves let you estimate the compute budget needed to reach a target performance level before starting the experiment.
When do I stop? As you approach L_inf (the irreducible noise floor), further scaling yields diminishing returns. The scaling curve tells you when you are within 90% of the achievable performance.

Scaling laws for foundation models

Foundation models bet heavily on scaling laws. The argument is: pre-training on 100M rows costs $X but yields a model that transfers to N tasks, each of which would cost $Y to build individually. If N * Y > X, the foundation model is more cost-effective.

Scaling laws make this calculation concrete. KumoRFM's performance scales predictably with pre-training data size, and the improvement transfers proportionally to downstream tasks. This makes the pre-training investment quantifiable.

Key Takeaways

1Scaling laws: GNN performance improves as a power law of data size, model size, and compute. This makes performance predictable, turning model development into resource planning.
2Data scaling is the most efficient axis for GNNs. Each new node brings its relational context (edges, neighbors), making the marginal information per datapoint higher than in tabular or text data.
3Depth scaling is limited by over-smoothing in standard GNNs (3-5 layers max). Width scaling follows standard power laws. This makes optimal GNNs wider and shallower than language models.
4Practical use: scaling curves answer 'should I get more data or a bigger model?' and 'how much compute do I need for target performance?' before running expensive experiments.
5Foundation models are a bet on scaling laws: if pre-training cost scales sublinearly with the number of downstream tasks, the amortized cost per task decreases with scale.

Scaling Laws: How GNN Performance Improves Predictably with More Data