What is a temporal split?

A temporal split divides the dataset by time: training data comes from before a cutoff date, and test data comes from after it. For graphs, this means training on the graph as it existed at time T and testing on events that occurred after T. This simulates real deployment where you always predict the future from the past.

Why not use random splits for graph data?

Random splits allow future information to leak into training. If a March transaction is in the test set but its February and April neighbors are in the training set, the training process sees temporal context from after the March event. Random splits consistently overestimate performance by 5-15% on temporal tasks.

How do you choose the split cutoff date?

Common practice: use 70% of the time range for training, 10% for validation, and 20% for testing. The exact dates should align with business cycles (avoid splitting mid-quarter) and ensure sufficient events in each split. For highly seasonal data, consider multiple temporal splits across different seasons.

Temporal Splits: Splitting Train/Test by Time to Prevent Leakage | Kumo.ai

A temporal split divides the dataset by time instead of randomly. Training data consists of events that occurred before a cutoff date. Test data consists of events that occurred after the cutoff. The model learns from the past and is evaluated on the future. This is the only evaluation protocol that produces realistic performance estimates for time-dependent prediction tasks.

Why random splits fail on graphs

Consider a churn prediction task with a random 80/20 split. A customer in the test set has orders spanning January to June. Some of their January orders land in the training set, some April orders in the test set. During training, the GNN aggregates messages from the customer's January orders, which share product nodes with their April orders. Future purchase patterns leak into the training representation.

The result: development AUROC is 0.88. Production AUROC is 0.74. The 14% gap is entirely due to temporal leakage through the graph structure.

How to implement temporal splits

temporal_split.py

import pandas as pd

def temporal_split(timestamps, train_ratio=0.7, val_ratio=0.1):
    """Split data by time into train/val/test."""
    sorted_times = sorted(timestamps.unique())
    n = len(sorted_times)

    train_cutoff = sorted_times[int(n * train_ratio)]
    val_cutoff = sorted_times[int(n * (train_ratio + val_ratio))]

    train_mask = timestamps < train_cutoff
    val_mask = (timestamps >= train_cutoff) & (timestamps < val_cutoff)
    test_mask = timestamps >= val_cutoff

    return train_mask, val_mask, test_mask

# For graph data, apply to BOTH nodes and edges:
# 1. Node split: which entities to predict on
node_train, node_val, node_test = temporal_split(node_timestamps)

# 2. Edge filter: which edges are visible during training
# Training: only edges before train_cutoff
# Validation: only edges before val_cutoff
# Test: only edges before test_cutoff

The split applies to both target nodes and the edges visible during GNN computation. Both must respect the temporal boundary.

Graph-specific considerations

Edge visibility

In a temporal graph split, the edge set changes per split:

Training: Only edges with t < train_cutoff
Validation: Only edges with t < val_cutoff
Testing: Only edges with t < test_cutoff

This means the test set GNN computation uses a strictly larger graph than the training computation (it includes all training edges plus validation-period edges). This is correct: at test time, all past information is available.

New nodes

Some entities only appear after the cutoff (new customers, new products). These nodes have no training history. A robust model should handle both:

Existing entities: Have rich historical context from before the cutoff
New entities: Must be predicted from limited context (cold start)

Report metrics separately for existing and new entities. Performance on new entities is often 10-20% lower and matters for growth-stage businesses.

Common mistakes

Splitting nodes but not edges: Temporal node split with all edges visible during training still leaks through future edges to training nodes.
Using event time, not prediction time: The split should be on the prediction timestamp, not the event timestamp. Predicting March 1 churn should use the March 1 graph snapshot.
Leaking validation into training: Using validation performance to select features or tune hyperparameters, then reporting test performance. The validation set must be strictly between train and test in time.
Single split on seasonal data: If your single test set falls on Black Friday, performance looks great. Use rolling splits to average across periods.

Key Takeaways

1Temporal splits divide data by time: train on the past, test on the future. Random splits leak future information through the graph structure, inflating metrics by 5-15%.
2Apply the split to both nodes (which to predict) and edges (which are visible). Test-time GNN computation should only use edges from before the test cutoff.
3Handle new entities separately: nodes that first appear after the cutoff have no training history (cold start). Report metrics for existing and new entities separately.
4Rolling temporal splits (train on 1-6, test on 7; train on 1-7, test on 8) capture performance variation across seasons and distribution shifts. Report mean and std.
5This is non-negotiable for enterprise ML. Every benchmark (RelBench), every production system, and every audit uses temporal splits. Random splits are only valid for non-temporal tasks.

Temporal Splits: Splitting Train/Test by Time to Prevent Leakage