What metrics should I use to evaluate GNNs?

Depends on the task. Node classification: accuracy (balanced) or AUROC (imbalanced). Link prediction: MRR (mean reciprocal rank) or Hits@K. Graph classification: AUROC or AUPRC. Regression: MAE or RMSE. Always report confidence intervals from multiple runs.

What are the standard GNN benchmarks?

OGB (Open Graph Benchmark) for standard graph tasks: ogbn-arxiv, ogbn-products, ogbn-proteins for nodes; ogbl-collab, ogbl-citation2 for links; ogbg-molhiv, ogbg-molpcba for graphs. RelBench for relational database tasks. TU Datasets for small graph classification. MoleculeNet for molecular properties.

Why do GNN papers report different results on the same dataset?

Different random seeds, different data splits (random vs temporal), different hyperparameter tuning budgets, and different evaluation protocols (transductive vs inductive). OGB standardized splits and evaluation, making comparison reliable. Always compare using the official OGB evaluation protocol.

Benchmark Evaluation for GNNs: Metrics, Splits, and Baselines | Kumo.ai

Proper GNN evaluation requires the right metrics, appropriate data splits, meaningful baselines, and statistical rigor. The GNN literature is littered with inflated results from inappropriate evaluation: random splits on temporal data, accuracy on imbalanced classes, single-run results without confidence intervals, and missing non-graph baselines. Here is how to evaluate GNNs correctly.

Choosing the right metric

Node classification

Balanced classes: Accuracy is fine. Report per-class accuracy to catch classes where the model fails.
Imbalanced classes: AUROC (threshold-independent ranking quality), AUPRC (precision-recall for the minority class), and F1 at optimal threshold. Never report accuracy alone.

Link prediction

MRR (Mean Reciprocal Rank): For each positive edge, where does it rank among negative candidates? Higher is better.
Hits@K: What fraction of positive edges appear in the top K predictions? Practical for recommendation (top-10, top-50).

Graph classification / regression

Classification: AUROC (binary), or macro-averaged F1 (multi-class). Use AUPRC for severely imbalanced molecular tasks.
Regression: MAE or RMSE. Report both; they penalize outliers differently.

Data splits

proper_splits.py

# BAD: Random split on temporal data
train, test = random_split(data, [0.8, 0.2])
# Result: inflated by 5-15% from temporal leakage

# GOOD: Temporal split for temporal tasks
train = data[data.time < cutoff]
test = data[data.time >= cutoff]

# GOOD: Use OGB official splits for benchmarking
from ogb.nodeproppred import PygNodePropPredDataset
dataset = PygNodePropPredDataset(name='ogbn-arxiv')
split_idx = dataset.get_idx_split()  # official temporal split

# GOOD: Multiple temporal splits for robustness
for cutoff in [march, april, may, june]:
    train = data[data.time < cutoff]
    test = data[(data.time >= cutoff) & (data.time < cutoff + 30d)]
    # Evaluate and collect results
# Report: mean +/- std across windows

Use temporal splits for temporal data, OGB official splits for benchmarks, and multiple splits for robustness. Random splits are only valid for non-temporal tasks.

Meaningful baselines

Every GNN evaluation should include:

MLP baseline: Same features, no graph structure. This measures the value added by the graph. If the MLP is within 2% of the GNN, the graph structure is not helping.
Simple GNN baseline: 2-layer GCN with default hyperparameters. This establishes a minimum graph-aware baseline. If your complex architecture barely beats this, the complexity is not justified.
Label propagation: A non-neural baseline that propagates labels through edges. Strong on high-homophily graphs, providing a non-ML reference point.
Best published method: The current state-of-the-art on the specific benchmark. Check OGB leaderboards for up-to-date numbers.

Standard benchmarks

OGB (Open Graph Benchmark): Node (ogbn-arxiv, ogbn-products, ogbn-proteins), link (ogbl-collab, ogbl-citation2), and graph (ogbg-molhiv, ogbg-molpcba) tasks with standardized splits and evaluation.
RelBench: 7 relational databases, 30 prediction tasks, temporal splits. The standard for evaluating on enterprise-style relational data.
MoleculeNet: Suite of molecular property prediction tasks. Standard for drug discovery GNNs.
TU Datasets: Small graph classification datasets (MUTAG, PROTEINS, NCI1). Good for quick experiments, limited for serious evaluation.

Statistical rigor

Run each experiment with 5-10 different random seeds
Report mean and standard deviation
Use paired statistical tests (paired t-test or Wilcoxon) to claim improvements
A 0.3% improvement with 0.5% standard deviation is not significant
Report training time and model size alongside accuracy

Key Takeaways

1Choose metrics by task: AUROC for imbalanced classification, MRR/Hits@K for link prediction, MAE for regression. Never use accuracy on imbalanced data.
2Temporal splits for temporal tasks, OGB official splits for benchmarks. Random splits inflate results by 5-15% on time-dependent data and should be avoided.
3Always include an MLP baseline (no graph) and a simple GCN baseline. The MLP baseline measures the value added by graph structure, the most important diagnostic.
4Use standard benchmarks: OGB for general tasks, RelBench for relational data, MoleculeNet for molecules. Follow official evaluation protocols for comparable results.
5Report mean +/- std from 5+ runs. A 0.3% gain with 0.5% std is noise, not improvement. Also report training time and model size for practical relevance.

Benchmark Evaluation: How to Properly Evaluate GNN Models