Why time matters in graphs
Consider a fraud detection graph. A customer makes 100 transactions over 6 months. In a static graph, all 100 transactions exist as edges simultaneously. The GNN sees the customer’s entire history, including transactions that happened after the one you are trying to classify.
This is temporal leakage: the model uses future information to predict the past. It achieves 98% AUROC in offline evaluation (because it can see the answer) and 68% in production (where the future does not exist yet).
Three temporal graph approaches
1. Static snapshots
The simplest approach: build a new graph at each prediction time, including only edges that existed before that timestamp.
import torch
from torch_geometric.data import Data
def build_snapshot(edges_df, features_df, cutoff_time):
"""Build a graph using only edges before cutoff_time."""
mask = edges_df["timestamp"] < cutoff_time
filtered = edges_df[mask]
edge_index = torch.tensor(
[filtered["src"].values, filtered["dst"].values],
dtype=torch.long,
)
x = torch.tensor(features_df.values, dtype=torch.float32)
return Data(x=x, edge_index=edge_index)
# Build monthly snapshots for training
for month in training_months:
snapshot = build_snapshot(edges, features, month)
train_on_snapshot(model, snapshot)Static snapshots are correct but wasteful: you rebuild the entire graph for each prediction timestamp. For batch predictions, this means N graph constructions.
2. Discrete-time dynamic graphs
Build a sequence of graph snapshots at fixed intervals (hourly, daily, weekly) and model the temporal dynamics between them. This captures graph evolution without the overhead of continuous time.
3. Continuous-time dynamic graphs
Store every event with its exact timestamp and use temporal encodings to represent when edges were created. This is the most expressive but requires specialized architectures (TGN, TGAT) and temporal neighbor sampling.
Encoding time as features
Raw timestamps (Unix epoch seconds) are meaningless to neural networks. Convert them into learnable representations:
import torch
import numpy as np
def encode_time_features(timestamps, reference_time):
"""Convert timestamps to useful GNN features."""
# Relative time (most important)
dt = reference_time - timestamps # seconds since event
dt_hours = dt / 3600.0
dt_days = dt / 86400.0
# Cyclical encodings (capture periodicity)
hour = (timestamps % 86400) / 3600.0
day_of_week = ((timestamps // 86400) % 7).float()
hour_sin = torch.sin(2 * np.pi * hour / 24)
hour_cos = torch.cos(2 * np.pi * hour / 24)
dow_sin = torch.sin(2 * np.pi * day_of_week / 7)
dow_cos = torch.cos(2 * np.pi * day_of_week / 7)
# Log-scaled recency (compresses long tails)
log_recency = torch.log1p(dt_hours)
return torch.stack([
dt_hours, dt_days, log_recency,
hour_sin, hour_cos, dow_sin, dow_cos,
], dim=-1)Log-scaled recency is the single most predictive temporal feature in most production models. Recent events matter exponentially more than old ones.
Temporal sampling in PyG
PyG’s NeighborLoader supports temporal filtering through the time_attr parameter. This ensures that during training, each seed node only sees edges that existed before its prediction timestamp.
from torch_geometric.loader import NeighborLoader
# Assign timestamps to edges
data.edge_time = edge_timestamps
# Each training node has a prediction time
data.node_time = prediction_timestamps
loader = NeighborLoader(
data,
num_neighbors=[15, 10],
batch_size=512,
input_nodes=train_mask,
time_attr="edge_time",
input_time=data.node_time[train_mask],
)What breaks in production
- Clock skew: Different data sources report timestamps in different time zones or with different latencies. A transaction logged 30 seconds late can leak into the wrong snapshot. Normalize all timestamps to UTC and add a safety buffer.
- Feature staleness: Node features computed from aggregations (e.g., “average order value”) must be recomputed at each prediction timestamp. Using pre-aggregated features that include future data is a subtle form of leakage.
- Training-serving skew: If training uses daily snapshots but serving uses real-time data, the model sees a different temporal distribution. Align training and serving temporal granularity.