Data leakage occurs when information that would not be available at prediction time accidentally enters the training process. In graph ML, leakage is uniquely dangerous because information flows through edges. A single leaked feature on one node propagates through message passing to every node in its multi-hop neighborhood, silently inflating metrics across the entire graph.
Three types of graph leakage
Type 1: Temporal leakage
The most common and damaging form. Temporal leakage occurs when the GNN can see edges, features, or events from the future:
- Future edges: An order placed after the prediction date reveals that the customer did not churn.
- Future features: An account balance updated after the prediction date reflects transactions that have not happened yet.
- Consequence features: An account freeze triggered by the fraud being predicted. A collection action triggered by the default.
Type 2: Transductive leakage
In the standard transductive GNN setup, training and test nodes share the same graph. During message passing, a training node can receive messages from test nodes (whose labels are masked but whose features are visible). If test node features correlate with their labels, this leaks test-time information into training.
Type 3: Target leakage
Features that are derived from or strongly correlated with the prediction target. In tabular ML, this is straightforward (e.g., “days until churn” as a feature for churn prediction). In graphs, target leakage can be indirect: a neighbor's feature that was computed using the target node's label.
How to prevent leakage
# 1. Temporal sampling: filter edges at EVERY hop
def safe_neighbor_sample(node, pred_time, edge_index, edge_time):
mask = edge_time < pred_time # strict inequality
return edge_index[:, mask]
# 2. Temporal split: train/test by time, not random
train_mask = data.timestamp < cutoff_date
test_mask = data.timestamp >= cutoff_date
# NEVER use random splits for temporal tasks
# 3. Feature audit: check every feature
for feature in features:
# Does this feature exist at prediction time?
# Was it computed using future information?
# Is it derived from the target variable?
assert feature.available_at(prediction_time)
assert not feature.uses_future_data()
assert not feature.derived_from(target)Three mandatory checks. Skip any one and your model will silently cheat.
Detecting leakage
Leakage often goes undetected because it produces excellent development metrics. Warning signs:
- Suspiciously high accuracy: If your fraud detection model achieves 99% AUROC with a simple 2-layer GCN, something is probably leaking.
- Development-production gap: AUROC drops by more than 5% when moving to production. This is the definitive sign of leakage.
- Feature importance analysis: If removing temporal features causes a disproportionate accuracy drop, the model may be relying on future information.
- Temporal degradation: If the model performs well on recent test data but poorly on data further in the future, it may be learning temporal correlations that do not persist.
A checklist for graph ML projects
- Are train/test splits temporal (not random)?
- Does neighborhood sampling respect prediction timestamps at every hop?
- Are node features computed only from data available before the prediction time?
- Are consequence features (account freeze, collection action) excluded?
- In transductive setup, are test node labels fully masked?
- Has the model been validated on truly future held-out data?