What metric should I use for node classification?

Use accuracy for balanced datasets and macro-F1 or AUROC for imbalanced datasets (which is most production data). For fraud detection or rare-event prediction, AUROC or Average Precision (AP) are standard because they are threshold-independent and handle class imbalance well.

How do I evaluate link prediction models?

Use MRR (Mean Reciprocal Rank) and Hits@K for ranking-based evaluation. For binary link prediction, use AUROC on a balanced set of positive (real edges) and negative (sampled non-edges) pairs. The negative sampling strategy significantly affects the metric, so document it clearly.

Why do my offline GNN metrics not match production performance?

Three common causes: (1) temporal leakage in offline evaluation (using future data), (2) distribution shift between evaluation and production data, (3) negative sampling bias (easy negatives inflate offline metrics). Always use temporal splits and realistic negative sampling.

GNN Evaluation: AUROC, MAP@k, and Beyond | PyG Guide

Metrics by task type

Node classification

Predict a label for each node (fraud/not-fraud, churn/retain, category).

Accuracy: Only useful for balanced classes. A fraud detector with 99% accuracy that predicts “not fraud” for everything is useless.
AUROC: Area under the ROC curve. Threshold-independent, handles imbalance well. The standard metric for binary classification on graphs. KumoRFM reports 76.71 AUROC on RelBench vs 62.44 for flat-table baselines.
Average Precision (AP): Area under the precision-recall curve. More informative than AUROC when the positive rate is very low (< 1%). Use for fraud detection.
Macro-F1: Average F1 across all classes. Good for multi-class problems with imbalanced classes.

Link prediction

Predict missing or future edges (recommendations, knowledge graph completion).

MRR (Mean Reciprocal Rank): For each positive edge, rank it among negative edges. MRR is the mean of 1/rank. Higher is better. Standard for knowledge graph completion.
Hits@K: Fraction of positive edges ranked in the top K. Hits@10 is standard for recommendations. Hits@1 for knowledge graphs.
AUROC: Binary classification of positive vs negative edges. Simple but sensitive to negative sampling strategy.

link_eval.py

from torchmetrics.retrieval import RetrievalMRR, RetrievalHitRate

# Score all candidate edges
pos_scores = model.score(pos_edges)  # real edges
neg_scores = model.score(neg_edges)  # sampled negatives

# MRR: how high are positives ranked?
all_scores = torch.cat([pos_scores, neg_scores])
labels = torch.cat([
    torch.ones(len(pos_scores)),
    torch.zeros(len(neg_scores)),
])
indexes = torch.cat([
    torch.arange(len(pos_scores)),
    torch.arange(len(pos_scores)).repeat_interleave(num_neg),
])

mrr = RetrievalMRR()
print(f"MRR: {mrr(all_scores, labels, indexes):.4f}")

MRR evaluation requires grouping negatives with their corresponding positive. The indexes tensor maps each score to its query.

Graph classification

Predict a property of the entire graph (molecular activity, protein function).

AUROC: For binary graph classification.
MAE/RMSE: For graph regression (e.g., QM9 molecular properties).
Accuracy: Acceptable here because graph-level datasets are often balanced.

The split strategy matters more than the metric

Random split (wrong for production)

Randomly splitting nodes into train/val/test is the default in most tutorials. It is wrong for production because it allows temporal leakage: the model trains on data from Tuesday and evaluates on data from Monday. In production, you always predict the future.

Temporal split (correct for production)

temporal_split.py

# Correct: temporal split
train_mask = data.node_time < cutoff_train   # before Feb 1
val_mask = (data.node_time >= cutoff_train) & \
           (data.node_time < cutoff_val)      # Feb 1-15
test_mask = data.node_time >= cutoff_val     # after Feb 15

# During training, only use edges before cutoff_train
train_edges = data.edge_index[:, data.edge_time < cutoff_train]

Temporal splits simulate production conditions: train on the past, evaluate on the future. This is the only valid split for production GNN evaluation.

Negative sampling for link prediction

Link prediction metrics depend heavily on how negative samples (non-edges) are generated:

Random negatives: Sample random non-edges. Easy to distinguish from positives. Inflates AUROC to 95%+.
Hard negatives: Sample non-edges between nodes that are close in the graph (2-3 hops). Much harder to distinguish. Gives realistic AUROC (70-85%).
Per-source negatives: For each source node, sample negatives from all possible targets. Standard for ranking metrics (MRR, Hits@K).

Segmented evaluation

Aggregate metrics hide performance variation:

By node degree: Well-connected nodes (degree > 50) may have 90% AUROC while cold-start nodes (degree < 5) have 65%. Report both.
By node type: In heterogeneous graphs, some node types are harder to predict than others. Report per-type metrics.
By time: Performance may degrade as predictions get further from training data. Track metric decay over time.

What breaks in production

Metric gaming: Optimizing for AUROC can produce models with great ranking but poor calibration. If you need calibrated probabilities (risk scores), add Brier score or calibration error to your evaluation.
Evaluation data leakage: If your evaluation pipeline shares code with training, feature leakage can infect both. Run evaluation in a separate pipeline with independent data loading.
Offline-online gap: Even with temporal splits, offline evaluation uses static graphs while production graphs change continuously. Run online A/B tests before trusting offline improvements.

GNN Evaluation: AUROC, MAP@k, and Beyond