Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Guide8 min read

Benchmark Evaluation: How to Properly Evaluate GNN Models

The difference between a publishable result and a reliable one is evaluation rigor. Right metrics, right splits, right baselines, and right confidence intervals. Here is how to evaluate GNNs without fooling yourself.

PyTorch Geometric

TL;DR

  • 1Choose metrics by task: AUROC for classification, MRR/Hits@K for link prediction, MAE for regression. Never use accuracy on imbalanced data. Always report confidence intervals from 5+ runs.
  • 2Use standardized splits: OGB provides fixed splits. For temporal tasks, use temporal splits. Random splits inflate results by 5-15% on time-dependent data.
  • 3Include meaningful baselines: MLP (no graph structure), simple GNN (GCN 2-layer), and the best published method. The MLP baseline measures the value added by graph structure.
  • 4Standard benchmarks: OGB for general graph tasks, RelBench for relational databases, MoleculeNet for molecules. Use official evaluation protocols for comparable results.
  • 5Report what matters: mean +/- std over runs, training time, model size, and comparison to non-graph baselines. A 1% improvement that requires 100x compute is not always worth it.

Proper GNN evaluation requires the right metrics, appropriate data splits, meaningful baselines, and statistical rigor. The GNN literature is littered with inflated results from inappropriate evaluation: random splits on temporal data, accuracy on imbalanced classes, single-run results without confidence intervals, and missing non-graph baselines. Here is how to evaluate GNNs correctly.

Choosing the right metric

Node classification

  • Balanced classes: Accuracy is fine. Report per-class accuracy to catch classes where the model fails.
  • Imbalanced classes: AUROC (threshold-independent ranking quality), AUPRC (precision-recall for the minority class), and F1 at optimal threshold. Never report accuracy alone.

Link prediction

  • MRR (Mean Reciprocal Rank): For each positive edge, where does it rank among negative candidates? Higher is better.
  • Hits@K: What fraction of positive edges appear in the top K predictions? Practical for recommendation (top-10, top-50).

Graph classification / regression

  • Classification: AUROC (binary), or macro-averaged F1 (multi-class). Use AUPRC for severely imbalanced molecular tasks.
  • Regression: MAE or RMSE. Report both; they penalize outliers differently.

Data splits

proper_splits.py
# BAD: Random split on temporal data
train, test = random_split(data, [0.8, 0.2])
# Result: inflated by 5-15% from temporal leakage

# GOOD: Temporal split for temporal tasks
train = data[data.time < cutoff]
test = data[data.time >= cutoff]

# GOOD: Use OGB official splits for benchmarking
from ogb.nodeproppred import PygNodePropPredDataset
dataset = PygNodePropPredDataset(name='ogbn-arxiv')
split_idx = dataset.get_idx_split()  # official temporal split

# GOOD: Multiple temporal splits for robustness
for cutoff in [march, april, may, june]:
    train = data[data.time < cutoff]
    test = data[(data.time >= cutoff) & (data.time < cutoff + 30d)]
    # Evaluate and collect results
# Report: mean +/- std across windows

Use temporal splits for temporal data, OGB official splits for benchmarks, and multiple splits for robustness. Random splits are only valid for non-temporal tasks.

Meaningful baselines

Every GNN evaluation should include:

  1. MLP baseline: Same features, no graph structure. This measures the value added by the graph. If the MLP is within 2% of the GNN, the graph structure is not helping.
  2. Simple GNN baseline: 2-layer GCN with default hyperparameters. This establishes a minimum graph-aware baseline. If your complex architecture barely beats this, the complexity is not justified.
  3. Label propagation: A non-neural baseline that propagates labels through edges. Strong on high-homophily graphs, providing a non-ML reference point.
  4. Best published method: The current state-of-the-art on the specific benchmark. Check OGB leaderboards for up-to-date numbers.

Standard benchmarks

  • OGB (Open Graph Benchmark): Node (ogbn-arxiv, ogbn-products, ogbn-proteins), link (ogbl-collab, ogbl-citation2), and graph (ogbg-molhiv, ogbg-molpcba) tasks with standardized splits and evaluation.
  • RelBench: 7 relational databases, 30 prediction tasks, temporal splits. The standard for evaluating on enterprise-style relational data.
  • MoleculeNet: Suite of molecular property prediction tasks. Standard for drug discovery GNNs.
  • TU Datasets: Small graph classification datasets (MUTAG, PROTEINS, NCI1). Good for quick experiments, limited for serious evaluation.

Statistical rigor

  • Run each experiment with 5-10 different random seeds
  • Report mean and standard deviation
  • Use paired statistical tests (paired t-test or Wilcoxon) to claim improvements
  • A 0.3% improvement with 0.5% standard deviation is not significant
  • Report training time and model size alongside accuracy

Frequently asked questions

What metrics should I use to evaluate GNNs?

Depends on the task. Node classification: accuracy (balanced) or AUROC (imbalanced). Link prediction: MRR (mean reciprocal rank) or Hits@K. Graph classification: AUROC or AUPRC. Regression: MAE or RMSE. Always report confidence intervals from multiple runs.

What are the standard GNN benchmarks?

OGB (Open Graph Benchmark) for standard graph tasks: ogbn-arxiv, ogbn-products, ogbn-proteins for nodes; ogbl-collab, ogbl-citation2 for links; ogbg-molhiv, ogbg-molpcba for graphs. RelBench for relational database tasks. TU Datasets for small graph classification. MoleculeNet for molecular properties.

Why do GNN papers report different results on the same dataset?

Different random seeds, different data splits (random vs temporal), different hyperparameter tuning budgets, and different evaluation protocols (transductive vs inductive). OGB standardized splits and evaluation, making comparison reliable. Always compare using the official OGB evaluation protocol.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.