Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Guide7 min read

Data Leakage: When Future Information Accidentally Contaminates Training

Data leakage is the silent killer of ML models. In graphs, it is worse: a single leaked feature propagates through edges to contaminate the entire neighborhood. Your 95% AUROC in development becomes 65% in production.

PyTorch Geometric

TL;DR

  • 1Data leakage occurs when information unavailable at prediction time enters the training process. In graphs, leaked information propagates through multi-hop edges, contaminating representations far from the source.
  • 2Three types: temporal leakage (future edges/features), transductive leakage (test labels visible through shared neighbors), and target leakage (features derived from the target variable).
  • 3Temporal leakage is the most common and damaging in enterprise graphs. An order placed after the prediction date, an account frozen due to the fraud being predicted, a collection action triggered by the default.
  • 4Prevention requires temporal sampling (filter edges at every hop), temporal splitting (train/test by time), and careful feature auditing (no features derived from future events).
  • 5Detection: compare development AUROC vs production AUROC. A gap larger than 5% usually indicates leakage. Also check if removing temporal features causes a disproportionate accuracy drop.

Data leakage occurs when information that would not be available at prediction time accidentally enters the training process. In graph ML, leakage is uniquely dangerous because information flows through edges. A single leaked feature on one node propagates through message passing to every node in its multi-hop neighborhood, silently inflating metrics across the entire graph.

Three types of graph leakage

Type 1: Temporal leakage

The most common and damaging form. Temporal leakage occurs when the GNN can see edges, features, or events from the future:

  • Future edges: An order placed after the prediction date reveals that the customer did not churn.
  • Future features: An account balance updated after the prediction date reflects transactions that have not happened yet.
  • Consequence features: An account freeze triggered by the fraud being predicted. A collection action triggered by the default.

Type 2: Transductive leakage

In the standard transductive GNN setup, training and test nodes share the same graph. During message passing, a training node can receive messages from test nodes (whose labels are masked but whose features are visible). If test node features correlate with their labels, this leaks test-time information into training.

Type 3: Target leakage

Features that are derived from or strongly correlated with the prediction target. In tabular ML, this is straightforward (e.g., “days until churn” as a feature for churn prediction). In graphs, target leakage can be indirect: a neighbor's feature that was computed using the target node's label.

How to prevent leakage

leakage_prevention.py
# 1. Temporal sampling: filter edges at EVERY hop
def safe_neighbor_sample(node, pred_time, edge_index, edge_time):
    mask = edge_time < pred_time  # strict inequality
    return edge_index[:, mask]

# 2. Temporal split: train/test by time, not random
train_mask = data.timestamp < cutoff_date
test_mask = data.timestamp >= cutoff_date
# NEVER use random splits for temporal tasks

# 3. Feature audit: check every feature
for feature in features:
    # Does this feature exist at prediction time?
    # Was it computed using future information?
    # Is it derived from the target variable?
    assert feature.available_at(prediction_time)
    assert not feature.uses_future_data()
    assert not feature.derived_from(target)

Three mandatory checks. Skip any one and your model will silently cheat.

Detecting leakage

Leakage often goes undetected because it produces excellent development metrics. Warning signs:

  • Suspiciously high accuracy: If your fraud detection model achieves 99% AUROC with a simple 2-layer GCN, something is probably leaking.
  • Development-production gap: AUROC drops by more than 5% when moving to production. This is the definitive sign of leakage.
  • Feature importance analysis: If removing temporal features causes a disproportionate accuracy drop, the model may be relying on future information.
  • Temporal degradation: If the model performs well on recent test data but poorly on data further in the future, it may be learning temporal correlations that do not persist.

A checklist for graph ML projects

  1. Are train/test splits temporal (not random)?
  2. Does neighborhood sampling respect prediction timestamps at every hop?
  3. Are node features computed only from data available before the prediction time?
  4. Are consequence features (account freeze, collection action) excluded?
  5. In transductive setup, are test node labels fully masked?
  6. Has the model been validated on truly future held-out data?

Frequently asked questions

What is data leakage in graph ML?

Data leakage in graph ML occurs when information that would not be available at prediction time is accessible to the model during training. In graphs, leakage is especially dangerous because it propagates through edges: a single leaked feature on one node can contaminate representations of all nodes within its multi-hop neighborhood.

Why is graph leakage harder to detect than tabular leakage?

In tabular ML, leakage is typically a column that directly correlates with the target. In graphs, leakage can flow through 2-hop or 3-hop paths. A label from a test node can leak to a training node through shared neighbors. Future edges can introduce information that only exists after the prediction date. These are invisible in feature-level analysis.

How do you prevent data leakage in GNNs?

Four practices: (1) temporal sampling that filters edges by timestamp at every hop, (2) temporal train/test splits instead of random splits, (3) removing or masking features derived from future events, and (4) inductive evaluation where test nodes are completely unseen during training, not just their labels.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.