Time Series Forecasting with Graph Transformers: An Interactive Guide | Kumo.ai

01

Why Graph Structure Matters for Forecasting

Most time series forecasting treats each sequence in isolation. You have daily sales for Store #42, and you build a model that looks at Store #42's history to predict its future. Facebook Prophet, ARIMA, and most deep learning forecasters operate this way: one sequence in, one forecast out.

But real-world time series rarely exist in a vacuum. Store #42's sales depend on its geography, the products it stocks, marketing campaigns running in its region, competitor openings nearby, and the purchasing behavior of its customer base. All of this context lives in related database tables connected through foreign keys: stores, products, transactions, customers, campaigns.

This is the core insight behind Kumo's approach: time series data in relational databases is naturally graph-structured. Each entity (a store, a product, a customer) is a node. Foreign-key relationships are edges. The time series you want to forecast is an attribute of a specific node, but the signals that drive it propagate across the entire graph.

The framework introduced in this research leverages Graph Transformers to encode this relational context directly into the forecasting pipeline. Rather than hand-engineering features from related tables, the model learns which cross-table signals matter and how they influence future values.

02

The Four Conditioning Signals

The forecasting framework predicts future values for a given entity (say, a store) by conditioning on four distinct signal types. Each captures a different dimension of context that influences the forecast.

Formally, the prediction at time t for entity e is:

x_e(t) = f( rho(t), c, z_e, p_e )

Where each input carries a specific role:

1. Date-time frequency encodings (rho(t))

These are frequency-encoded representations of the current timestamp: date, time of day, day of week, month, and other cyclical temporal features. They let the model learn that Saturdays behave differently from Tuesdays, that December has different patterns than March, and that 2:00 PM traffic differs from 6:00 AM traffic. The frequency encodings capture periodicity at different scales simultaneously.

2. Calendar embeddings (c)

Special events like holidays, promotional periods, and local observances require dedicated treatment. The framework processes calendar events through 1D CNNs that create context windows around each event. This lets the model see a window of adjacent calendar information, capturing lead and lag effects around holidays and promotional periods rather than treating each event as an isolated point.

3. Graph entity encodings (z_e)

This is where Graph Transformers enter the picture. For each entity being forecast, the framework samples a temporal subgraph: the entity's neighbors, their neighbors, and the features attached to each node. A Graph Transformer processes this subgraph to produce a fixed-size embedding that captures the entity's relational context. A store's graph encoding might reflect its product mix, customer demographics, regional trends, and competitive landscape, all extracted automatically from the connected tables.

4. Past sequence encodings (p_e)

The entity's own historical time series values are encoded through a sequence model. The authors tested transformers, CNNs, and MLPs for this component and found that 1D convolutions over the temporal dimension provided the best efficiency-to-accuracy tradeoff. This encoding captures autoregressive trends, recent momentum, and local patterns specific to the entity.

1

Date-Time Encodings

Sinusoidal frequency encodings of timestamp: day, week, month, year cycles.

→

2

Calendar Embeddings

1D CNN over holiday and event markers with context windows for lead/lag effects.

→

3

Graph Entity Encoding

Graph Transformer processes temporal subgraph of related entities and features.

→

4

Past Sequence Encoding

1D convolution over the entity's historical time series values.

→

5

Forecast

Combined signals produce point estimates or full distributional forecasts.

03

How Graph Transformers Encode Relational Structure

The graph encoding step is the architectural innovation that distinguishes this framework from conventional forecasting methods. The goal: transform a local neighborhood of the relational graph into a dense vector that captures all relevant cross-table signals for a given entity.

The process works as follows:

Temporal subgraph sampling

For each entity being forecast, the framework samples a subgraph from the relational database. This sampling is temporally aware: it only includes nodes and edges with timestamps earlier than the current prediction time. This strict temporal filtering prevents data leakage, ensuring the model never sees future information during training or inference.

For example, when forecasting Store #42's visits for next Tuesday, the subgraph includes that store's past transactions, the products involved in those transactions, the customers who made them, and features of all these entities, but only data from before the prediction date.

Graph Transformer processing

The sampled subgraph is processed through a Graph Transformer, which uses positional encodings to adapt the standard Transformer architecture for graph-structured inputs. Unlike sequence Transformers that rely on sequential position, Graph Transformers encode the structural position of each node within the graph. This lets the attention mechanism reason about both individual node features and the topology of connections between them.

The output is a single embedding vector z_e for the target entity, computed as:

z_e = T(G_e, X_e)

Where G_e is the sampled subgraph structure and X_e is the matrix of node features within that subgraph.

Why not standard message-passing GNNs?

Standard GNN architectures (GCN, GraphSAGE) aggregate neighbor information through fixed message-passing rules. Graph Transformers use attention over the entire sampled subgraph, which allows them to selectively focus on the most relevant nodes regardless of their graph distance. A distant but highly relevant node (a competitor store three hops away) can receive high attention weight, while a nearby but irrelevant node gets low weight. This selective aggregation is particularly valuable for relational databases where not all connections carry equal predictive signal.

04

Regression vs. Generative Forecasting

The framework supports two distinct paradigms for producing forecasts, and the choice between them has significant practical implications.

Regression (Predictive)

Fast, simple, point estimates

+Single forward pass per prediction
+Straightforward MSE training
+Low computational cost at inference

−Produces only point estimates
−Prone to mean collapse on multimodal distributions
−Implicitly assumes Gaussian error distribution
−Loses high-frequency detail in volatile series

Generative (Diffusion)

Richer, distributional, captures uncertainty

+Full distributional forecasts with uncertainty bands
+Captures multimodal outcomes (not just the mean)
+Better preservation of high-frequency details
+Enables extraction of quantile bands and modes

−Requires 1,000 denoising steps (DDPM schedule)
−Higher computational cost at inference
−More complex training pipeline

Regression approach

The predictive model uses MLPs trained with mean-squared error loss. Given the four conditioning signals, it outputs a single point estimate for each future timestep. This implicitly assumes that the forecast errors follow a Gaussian distribution centered on the predicted value. For smooth, unimodal time series, this works well.

The problem appears with multimodal distributions. If a store's daily visits could plausibly be either 50 (normal day) or 200 (event day), the regression model will predict 125: the mean, which is actually the least likely outcome. This is called mean collapse, and it is a fundamental limitation of MSE-trained point estimators.

Generative approach

The generative model uses a conditional diffusion process based on Denoising Diffusion Probabilistic Models (DDPM). Instead of predicting the future values directly, it learns to denoise: starting from pure Gaussian noise, the model iteratively refines a forecast over 1,000 steps, conditioned on the same four signals. Each run produces one sample from the learned distribution.

By running the diffusion process multiple times, you get a collection of plausible forecasts. From these samples, you can extract:

Quantile bands (10th, 50th, 90th percentiles) for confidence intervals
Multiple modes representing distinct possible outcomes
Full uncertainty estimates that vary across the forecast horizon

05

Training: Temporal Sampling and Loss

Training the forecasting model requires careful handling of time. Unlike image classification or NLP tasks where training examples are independent, time series forecasting must respect temporal ordering to avoid leakage.

Training procedure

For each training iteration, the framework:

Samples a time point from the available training period
Constructs a temporal subgraph using only data from before that time point (no future information)
Computes all four conditioning signals at that time point: date-time encodings, calendar embeddings, graph entity encoding from the subgraph, and past sequence encoding from historical values
Predicts future values for the forecast horizon
Minimizes the loss against ground truth future values

Loss function

For the regression model, the loss is standard MSE summed over all entities and future timesteps:

L = sum over (e, t>n) of ||x_e(t) - x_hat_e(t)||^2

For the generative model, the loss trains the denoising network: at each training step, noise is added to the ground truth future sequence at a random noise level from the DDPM schedule, and the model learns to predict the added noise. This is the standard diffusion training objective.

Why temporal sampling matters

The temporal subgraph sampling during training serves two purposes. First, it prevents data leakage by enforcing that the model only conditions on past information. Second, it provides data augmentation: by sampling different time points, the model sees the same entities at different stages of their history, learning how relational context evolves over time.

06

Forecasting Results

The framework was evaluated on a concrete task: predicting daily store visits 90 days into the future. Three methods were compared: Facebook Prophet (a widely-used statistical baseline), the predictive (regression) Graph Transformer model, and the generative (diffusion) Graph Transformer model.

90-Day Store Visit Forecasting: Graph Transformers vs. Prophet
Method	MAE	MAPE
Facebook Prophet	5.87	0.21
Predictive Graph Transformer	5.26	0.18
Generative Graph Transformer	5.29	0.18

Quantitative findings

Both the predictive and generative Graph Transformer models outperformed Facebook Prophet on MAE (Mean Absolute Error) and MAPE (Mean Absolute Percentage Error). The predictive model achieved a MAE of 5.26 vs. Prophet's 5.87, a 10.4% reduction in error. MAPE dropped from 0.21 to 0.18, a 14.3% relative improvement.

The generative model matched the predictive model on MAPE (0.18) with a nearly identical MAE (5.29 vs. 5.26). The small difference in MAE is expected: the generative model optimizes for distributional accuracy rather than point prediction.

Qualitative observations

The most significant difference between Prophet and the graph-based models appears in the forecast visualizations:

Prophet captures overall trend and weekly seasonality but exhibits greater divergence from ground truth over the 90-day horizon. Its forecasts show pronounced mean collapse, smoothing out day-to-day variation into a flat seasonal pattern.
Predictive Graph Transformer tracks the actual values more closely, capturing local variations that Prophet misses. The incorporation of relational signals from the graph helps the model anticipate store-specific patterns.
Generative Graph Transformer shows less mean collapse than both alternatives. It preserves high-frequency details and captures day-to-day variation that the regression approach smooths over. The diffusion-based sampling produces forecasts that look more realistic.

The source of the improvement

The performance gain comes from the graph structure. Prophet operates on the store's historical visits alone. The Graph Transformer models additionally leverage signals from connected entities: transaction patterns, product-level data, customer behavior, and any other tables linked through foreign keys. These relational signals provide context that a single time series cannot contain.

07

Practical Implications and Tools

This research demonstrates a repeatable pattern: when time series data exists within a relational database (which it almost always does in enterprise settings), encoding the relational structure through Graph Transformers improves forecasting accuracy. The framework provides an end-to-end solution that works directly on graph structures, making predictions on node subsets while leveraging signals from entire graphs.

When this approach applies

The graph-based forecasting framework is most valuable when:

Multiple related tables exist. If your forecasting target (sales, visits, demand) connects to other tables through foreign keys, those tables contain signals the model can exploit. Product catalogs, customer profiles, marketing campaigns, geographic data, and supplier information all contribute context.
Entity-level forecasting is needed. Forecasting aggregate totals (company-wide revenue) benefits less from graph structure. Forecasting at the entity level (per-store, per-product, per-customer) benefits substantially because each entity has a unique relational neighborhood.
Uncertainty matters. The generative model provides distributional forecasts with uncertainty bands. For inventory planning, capacity allocation, or risk management, knowing the range of plausible outcomes is often more valuable than a single point estimate.

Implementation ecosystem

The framework builds on established open-source tools:

PyTorch Geometric (PyG) provides the Graph Transformer implementation, temporal sampling utilities, and graph data structures used throughout the pipeline.
RelBench offers a unified evaluation framework for benchmarking ML models on relational data tasks, including the forecasting scenarios used in this research.
Relational Deep Learning (RDL) provides the framework for automatically extracting graph structures from relational databases, transforming multi-table schemas into node-feature graphs without manual feature engineering.

From research to production

The pipeline demonstrated here (temporal subgraph sampling, Graph Transformer encoding, and either regression or diffusion-based forecasting) is directly applicable to enterprise forecasting workloads. Product demand forecasting, workforce scheduling, energy load prediction, and financial metric forecasting all involve time series embedded in relational databases. The key architectural choice is treating the relational context as a first-class input rather than discarding it or manually engineering features from it.