What is data leakage in graph ML?

Data leakage in graph ML occurs when information that would not be available at prediction time is accessible to the model during training. In graphs, leakage is especially dangerous because it propagates through edges: a single leaked feature on one node can contaminate representations of all nodes within its multi-hop neighborhood.

Why is graph leakage harder to detect than tabular leakage?

In tabular ML, leakage is typically a column that directly correlates with the target. In graphs, leakage can flow through 2-hop or 3-hop paths. A label from a test node can leak to a training node through shared neighbors. Future edges can introduce information that only exists after the prediction date. These are invisible in feature-level analysis.

How do you prevent data leakage in GNNs?

Four practices: (1) temporal sampling that filters edges by timestamp at every hop, (2) temporal train/test splits instead of random splits, (3) removing or masking features derived from future events, and (4) inductive evaluation where test nodes are completely unseen during training, not just their labels.

Data Leakage in Graph ML: When Future Information Contaminates Training | Kumo.ai

Data leakage occurs when information that would not be available at prediction time accidentally enters the training process. In graph ML, leakage is uniquely dangerous because information flows through edges. A single leaked feature on one node propagates through message passing to every node in its multi-hop neighborhood, silently inflating metrics across the entire graph.

Three types of graph leakage

Type 1: Temporal leakage

The most common and damaging form. Temporal leakage occurs when the GNN can see edges, features, or events from the future:

Future edges: An order placed after the prediction date reveals that the customer did not churn.
Future features: An account balance updated after the prediction date reflects transactions that have not happened yet.
Consequence features: An account freeze triggered by the fraud being predicted. A collection action triggered by the default.

Type 2: Transductive leakage

In the standard transductive GNN setup, training and test nodes share the same graph. During message passing, a training node can receive messages from test nodes (whose labels are masked but whose features are visible). If test node features correlate with their labels, this leaks test-time information into training.

Type 3: Target leakage

Features that are derived from or strongly correlated with the prediction target. In tabular ML, this is straightforward (e.g., “days until churn” as a feature for churn prediction). In graphs, target leakage can be indirect: a neighbor's feature that was computed using the target node's label.

How to prevent leakage

leakage_prevention.py

# 1. Temporal sampling: filter edges at EVERY hop
def safe_neighbor_sample(node, pred_time, edge_index, edge_time):
    mask = edge_time < pred_time  # strict inequality
    return edge_index[:, mask]

# 2. Temporal split: train/test by time, not random
train_mask = data.timestamp < cutoff_date
test_mask = data.timestamp >= cutoff_date
# NEVER use random splits for temporal tasks

# 3. Feature audit: check every feature
for feature in features:
    # Does this feature exist at prediction time?
    # Was it computed using future information?
    # Is it derived from the target variable?
    assert feature.available_at(prediction_time)
    assert not feature.uses_future_data()
    assert not feature.derived_from(target)

Three mandatory checks. Skip any one and your model will silently cheat.

Detecting leakage

Leakage often goes undetected because it produces excellent development metrics. Warning signs:

Suspiciously high accuracy: If your fraud detection model achieves 99% AUROC with a simple 2-layer GCN, something is probably leaking.
Development-production gap: AUROC drops by more than 5% when moving to production. This is the definitive sign of leakage.
Feature importance analysis: If removing temporal features causes a disproportionate accuracy drop, the model may be relying on future information.
Temporal degradation: If the model performs well on recent test data but poorly on data further in the future, it may be learning temporal correlations that do not persist.

A checklist for graph ML projects

Are train/test splits temporal (not random)?
Does neighborhood sampling respect prediction timestamps at every hop?
Are node features computed only from data available before the prediction time?
Are consequence features (account freeze, collection action) excluded?
In transductive setup, are test node labels fully masked?
Has the model been validated on truly future held-out data?

Key Takeaways

1Data leakage in graphs propagates through edges: one leaked feature contaminates the entire multi-hop neighborhood, silently inflating metrics across the graph.
2Three types: temporal (future edges/features), transductive (test labels visible through shared neighbors), and target (features derived from the prediction target).
3Temporal leakage is the most common in enterprise tasks. Consequence features (account freezes, collection actions) are especially insidious because they are caused by the event being predicted.
4Prevention requires all three: temporal sampling (filter edges at every hop), temporal splitting (train/test by time), and feature auditing (no future-derived features).
5Detection: a development-production AUROC gap larger than 5% is the definitive sign. Suspiciously high accuracy with simple models is the first warning.

Data Leakage: When Future Information Accidentally Contaminates Training