Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Guide7 min read

Temporal Splits: Splitting Train/Test by Time to Prevent Leakage

Random splits are the default in ML tutorials and the source of most inflated benchmarks. For any time-dependent graph task, you must split by time. The model trains on the past and is tested on the future, just like production.

PyTorch Geometric

TL;DR

  • 1Temporal splits divide data by time: train on events before a cutoff, validate on the next period, test on the period after that. This simulates real deployment where models predict the future from the past.
  • 2Random splits leak future information: a test event's future neighbors can appear in training, inflating metrics by 5-15%. Every temporal graph task requires temporal splits.
  • 3For graphs, the split applies to both nodes (which entities are predicted) and edges (which relationships are visible). Test nodes can only use edges from before the cutoff.
  • 4Use rolling temporal splits for robust evaluation: train on months 1-6, test on month 7; train on months 1-7, test on month 8. This captures performance across different time periods.
  • 5KumoRFM and RelBench use strict temporal splits as the evaluation standard, ensuring reported metrics reflect realistic production performance.

A temporal split divides the dataset by time instead of randomly. Training data consists of events that occurred before a cutoff date. Test data consists of events that occurred after the cutoff. The model learns from the past and is evaluated on the future. This is the only evaluation protocol that produces realistic performance estimates for time-dependent prediction tasks.

Why random splits fail on graphs

Consider a churn prediction task with a random 80/20 split. A customer in the test set has orders spanning January to June. Some of their January orders land in the training set, some April orders in the test set. During training, the GNN aggregates messages from the customer's January orders, which share product nodes with their April orders. Future purchase patterns leak into the training representation.

The result: development AUROC is 0.88. Production AUROC is 0.74. The 14% gap is entirely due to temporal leakage through the graph structure.

How to implement temporal splits

temporal_split.py
import pandas as pd

def temporal_split(timestamps, train_ratio=0.7, val_ratio=0.1):
    """Split data by time into train/val/test."""
    sorted_times = sorted(timestamps.unique())
    n = len(sorted_times)

    train_cutoff = sorted_times[int(n * train_ratio)]
    val_cutoff = sorted_times[int(n * (train_ratio + val_ratio))]

    train_mask = timestamps < train_cutoff
    val_mask = (timestamps >= train_cutoff) & (timestamps < val_cutoff)
    test_mask = timestamps >= val_cutoff

    return train_mask, val_mask, test_mask

# For graph data, apply to BOTH nodes and edges:
# 1. Node split: which entities to predict on
node_train, node_val, node_test = temporal_split(node_timestamps)

# 2. Edge filter: which edges are visible during training
# Training: only edges before train_cutoff
# Validation: only edges before val_cutoff
# Test: only edges before test_cutoff

The split applies to both target nodes and the edges visible during GNN computation. Both must respect the temporal boundary.

Graph-specific considerations

Edge visibility

In a temporal graph split, the edge set changes per split:

  • Training: Only edges with t < train_cutoff
  • Validation: Only edges with t < val_cutoff
  • Testing: Only edges with t < test_cutoff

This means the test set GNN computation uses a strictly larger graph than the training computation (it includes all training edges plus validation-period edges). This is correct: at test time, all past information is available.

New nodes

Some entities only appear after the cutoff (new customers, new products). These nodes have no training history. A robust model should handle both:

  • Existing entities: Have rich historical context from before the cutoff
  • New entities: Must be predicted from limited context (cold start)

Report metrics separately for existing and new entities. Performance on new entities is often 10-20% lower and matters for growth-stage businesses.

Common mistakes

  • Splitting nodes but not edges: Temporal node split with all edges visible during training still leaks through future edges to training nodes.
  • Using event time, not prediction time: The split should be on the prediction timestamp, not the event timestamp. Predicting March 1 churn should use the March 1 graph snapshot.
  • Leaking validation into training: Using validation performance to select features or tune hyperparameters, then reporting test performance. The validation set must be strictly between train and test in time.
  • Single split on seasonal data: If your single test set falls on Black Friday, performance looks great. Use rolling splits to average across periods.

Frequently asked questions

What is a temporal split?

A temporal split divides the dataset by time: training data comes from before a cutoff date, and test data comes from after it. For graphs, this means training on the graph as it existed at time T and testing on events that occurred after T. This simulates real deployment where you always predict the future from the past.

Why not use random splits for graph data?

Random splits allow future information to leak into training. If a March transaction is in the test set but its February and April neighbors are in the training set, the training process sees temporal context from after the March event. Random splits consistently overestimate performance by 5-15% on temporal tasks.

How do you choose the split cutoff date?

Common practice: use 70% of the time range for training, 10% for validation, and 20% for testing. The exact dates should align with business cycles (avoid splitting mid-quarter) and ensure sufficient events in each split. For highly seasonal data, consider multiple temporal splits across different seasons.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.