Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Guide7 min read

Scaling Laws: How GNN Performance Improves Predictably with More Data

More data, bigger models, more compute. Scaling laws tell you exactly how much each investment yields. For GNNs, the surprising finding is that graph size (more nodes and edges) scales performance more efficiently than model depth.

PyTorch Geometric

TL;DR

  • 1Scaling laws show that GNN performance improves as a power law: L(N) = aN^(-alpha), where N is data/model size. This makes performance predictable before committing resources.
  • 2Three scaling axes: data scaling (more nodes/edges), model scaling (wider/deeper), and compute scaling (more training steps/GPUs). Each follows a distinct power law exponent.
  • 3GNNs scale more efficiently with graph size than with depth. Adding more data (nodes, edges, tables) consistently helps. Adding more layers hits over-smoothing after 3-5 layers.
  • 4Width scaling (hidden dimension) follows language-model-like laws. Doubling hidden size yields consistent, predictable improvement until compute becomes the bottleneck.
  • 5For foundation models, scaling laws justify the enormous pre-training cost: the investment yields predictable returns across all downstream tasks.

Scaling laws are empirical relationships showing that neural network performance improves as a predictable power law function of data size, model size, or compute budget. For GNNs, this means you can forecast how much a 10x increase in training data or a 4x increase in hidden dimension will improve your model before actually running the experiment. This turns model development from guesswork into planning.

The power law relationship

Across many domains (language, vision, and now graphs), performance follows the same general form:

scaling_law.py
# The scaling law relationship
# L(N) = a * N^(-alpha) + L_inf
#
# L(N)   = loss at scale N
# N      = data size, model size, or compute
# a      = scaling coefficient
# alpha  = scaling exponent (0.05 to 0.5 typically)
# L_inf  = irreducible loss (noise floor)

# Example: data scaling for node classification
# With 10K nodes:  AUROC = 0.72
# With 100K nodes: AUROC = 0.78
# With 1M nodes:   AUROC = 0.83
# Each 10x data increase yields ~5-6% improvement

import numpy as np

def predict_performance(current_n, current_perf, target_n, alpha=0.1):
    """Predict performance at larger scale."""
    ratio = target_n / current_n
    improvement = (ratio ** alpha - 1) * current_perf
    return current_perf + improvement

The power law means returns diminish but never stop. Every doubling of data or model size yields a consistent percentage improvement.

Three scaling axes for GNNs

Data scaling: more nodes and edges

Adding more data to a graph means more nodes (entities), more edges (relationships), or both. For relational data, this is the most efficient scaling axis. Each new row in a table adds a node with its full relational context. The marginal information per additional node is high because it brings its neighborhood connections.

On RelBench, going from 1M to 10M training rows improves AUROC by 3-5% across tasks, with the improvement consistent and predictable from the scaling curve fit.

Model scaling: wider and deeper

GNN model scaling has an important asymmetry:

  • Width scaling (hidden dimension): Follows standard power law. Going from 64 to 256 hidden dimension yields consistent improvement. No ceiling until compute becomes limiting.
  • Depth scaling (number of layers): Hits over-smoothing after 3-5 layers for standard message passing. Graph transformers push this ceiling higher but it still exists. Depth scaling is fundamentally different from language models.

Compute scaling: more training steps

Given fixed data and model size, more training steps improve performance up to the convergence point. The optimal allocation balances data, model, and compute according to Chinchilla-like scaling relationships.

Practical applications of scaling laws

Scaling laws answer four critical questions:

  1. Should I collect more data? If your scaling curve shows alpha = 0.2 for data, doubling data yields about 15% loss reduction. If that is worth the collection cost, do it.
  2. Should I use a bigger model? If you are in the model-limited regime (loss flattens when training longer but improves with more parameters), scale up the model.
  3. How much will this cost? Scaling curves let you estimate the compute budget needed to reach a target performance level before starting the experiment.
  4. When do I stop? As you approach L_inf (the irreducible noise floor), further scaling yields diminishing returns. The scaling curve tells you when you are within 90% of the achievable performance.

Scaling laws for foundation models

Foundation models bet heavily on scaling laws. The argument is: pre-training on 100M rows costs $X but yields a model that transfers to N tasks, each of which would cost $Y to build individually. If N * Y > X, the foundation model is more cost-effective.

Scaling laws make this calculation concrete. KumoRFM's performance scales predictably with pre-training data size, and the improvement transfers proportionally to downstream tasks. This makes the pre-training investment quantifiable.

Frequently asked questions

What are scaling laws for graph neural networks?

Scaling laws are empirical relationships showing that GNN performance (measured by loss or accuracy) improves as a power law function of dataset size, model size, or compute budget. Doubling data might improve AUROC by a predictable amount, allowing you to forecast performance before committing resources.

Do GNNs follow the same scaling laws as language models?

The general pattern is similar (power law improvement), but the exponents differ. GNNs tend to scale more favorably with graph size (more nodes and edges) than with model depth (more layers), due to over-smoothing. Width scaling (hidden dimension) follows similar laws to language models.

How can scaling laws guide model development decisions?

Scaling laws tell you whether to invest in more data or a bigger model. If you are in the data-limited regime, collecting more training data is more cost-effective than increasing model size. If you are in the model-limited regime, scaling up architecture is the right move. Scaling curves make this visible before you commit resources.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.