Proper GNN evaluation requires the right metrics, appropriate data splits, meaningful baselines, and statistical rigor. The GNN literature is littered with inflated results from inappropriate evaluation: random splits on temporal data, accuracy on imbalanced classes, single-run results without confidence intervals, and missing non-graph baselines. Here is how to evaluate GNNs correctly.
Choosing the right metric
Node classification
- Balanced classes: Accuracy is fine. Report per-class accuracy to catch classes where the model fails.
- Imbalanced classes: AUROC (threshold-independent ranking quality), AUPRC (precision-recall for the minority class), and F1 at optimal threshold. Never report accuracy alone.
Link prediction
- MRR (Mean Reciprocal Rank): For each positive edge, where does it rank among negative candidates? Higher is better.
- Hits@K: What fraction of positive edges appear in the top K predictions? Practical for recommendation (top-10, top-50).
Graph classification / regression
- Classification: AUROC (binary), or macro-averaged F1 (multi-class). Use AUPRC for severely imbalanced molecular tasks.
- Regression: MAE or RMSE. Report both; they penalize outliers differently.
Data splits
# BAD: Random split on temporal data
train, test = random_split(data, [0.8, 0.2])
# Result: inflated by 5-15% from temporal leakage
# GOOD: Temporal split for temporal tasks
train = data[data.time < cutoff]
test = data[data.time >= cutoff]
# GOOD: Use OGB official splits for benchmarking
from ogb.nodeproppred import PygNodePropPredDataset
dataset = PygNodePropPredDataset(name='ogbn-arxiv')
split_idx = dataset.get_idx_split() # official temporal split
# GOOD: Multiple temporal splits for robustness
for cutoff in [march, april, may, june]:
train = data[data.time < cutoff]
test = data[(data.time >= cutoff) & (data.time < cutoff + 30d)]
# Evaluate and collect results
# Report: mean +/- std across windowsUse temporal splits for temporal data, OGB official splits for benchmarks, and multiple splits for robustness. Random splits are only valid for non-temporal tasks.
Meaningful baselines
Every GNN evaluation should include:
- MLP baseline: Same features, no graph structure. This measures the value added by the graph. If the MLP is within 2% of the GNN, the graph structure is not helping.
- Simple GNN baseline: 2-layer GCN with default hyperparameters. This establishes a minimum graph-aware baseline. If your complex architecture barely beats this, the complexity is not justified.
- Label propagation: A non-neural baseline that propagates labels through edges. Strong on high-homophily graphs, providing a non-ML reference point.
- Best published method: The current state-of-the-art on the specific benchmark. Check OGB leaderboards for up-to-date numbers.
Standard benchmarks
- OGB (Open Graph Benchmark): Node (ogbn-arxiv, ogbn-products, ogbn-proteins), link (ogbl-collab, ogbl-citation2), and graph (ogbg-molhiv, ogbg-molpcba) tasks with standardized splits and evaluation.
- RelBench: 7 relational databases, 30 prediction tasks, temporal splits. The standard for evaluating on enterprise-style relational data.
- MoleculeNet: Suite of molecular property prediction tasks. Standard for drug discovery GNNs.
- TU Datasets: Small graph classification datasets (MUTAG, PROTEINS, NCI1). Good for quick experiments, limited for serious evaluation.
Statistical rigor
- Run each experiment with 5-10 different random seeds
- Report mean and standard deviation
- Use paired statistical tests (paired t-test or Wilcoxon) to claim improvements
- A 0.3% improvement with 0.5% standard deviation is not significant
- Report training time and model size alongside accuracy