Accelerating Graph Construction for Production Graph Transformer Workloads | Kumo.ai

01

The Bottleneck: Graph Construction in Production

Kumo's Graph Transformer platform turns relational databases into temporal heterogeneous graphs, then runs transformer-based models over those graphs. Model training and inference get most of the attention, but in production the upstream stages (ingestion, materialization, graph engine startup) often consume a significant share of the total wall-clock time.

Two usage patterns make this especially painful:

Iterative development. Data scientists experiment rapidly: modifying which tables to include, adjusting primary keys, tweaking time columns, changing data types. Each configuration change historically triggered a full pipeline re-run, even when the underlying source data had not changed.
Repeated production training. Models retrain on fresh data daily or hourly. In most cases only a handful of tables contain new rows, while the majority remain unchanged. Yet the system rebuilt everything from scratch on every run.

The result: teams spent more time waiting for data preparation than for actual model training. Three targeted optimizations, released in Kumo versions 2.11 and 2.12, address this directly. Combined, they deliver 12-25% faster end-to-end training times, averaging around 20% improvement across customer workloads.

02

Anatomy of the Kumo Pipeline

Before examining the optimizations, it helps to understand the stages that every Kumo training run executes. The pipeline has three major phases, each with distinct compute characteristics.

Stage 1: Ingestion

Ingestion copies data from the customer's source system (Snowflake, Databricks, S3) into Kumo's secure data plane. The goal is to create a stable snapshot that downstream stages can process without contending with live updates. During ingestion, the system also performs data type conversion for each column and data cleaning (filtering null values in primary keys and timestamps).

Stage 2: Table materialization

Materialization transforms the ingested tabular data into tensor format suitable for graph processing. Each table's rows are assigned contiguous integer indices, and features are encoded and stored in PyTorch Geometric's remote backend. This stage converts human-readable database rows into the dense numerical representations the model consumes.

Stage 3: Edge materialization

Foreign key relationships between tables become edges in the graph. Edge materialization performs join operations on primary-foreign key pairs to produce edge lists. These are stored in a remote graph store. For large databases with many foreign key relationships, this stage can be the most expensive part of the pipeline.

Stage 4: Graph engine startup

The graph engine (feature store, graph store, and neighbor sampler) must load all materialized data before training can begin. In the original design, this loading phase waited until all materialization completed, creating idle time between the end of materialization and the start of training.

Kumo training pipeline (before optimization)

1

Ingestion

Copy from Snowflake / Databricks / S3, type conversion, null filtering

→

2

Table Materialization

Convert rows to tensor format, assign contiguous indices

→

3

Edge Materialization

Join on FK relationships to produce edge lists

→

4

Graph Engine Load

Load feature store, graph store, sampler (waits for all materialization)

→

5

Model Training

Graph transformer forward/backward passes

03

Optimization 1: Ingestion Skip

The first optimization targets a specific pain point in iterative development. When a data scientist changes a table's configuration (adjusting a data type, selecting a different primary key, modifying the time column), the original system triggered a full re-ingestion of that table from the source system.

This was wasteful. The source data itself had not changed. Only the interpretation of that data (which column is the primary key, what type to assign a column) had changed. Re-copying terabytes from Snowflake or Databricks just because a data scientist tweaked a configuration parameter added minutes or hours of unnecessary wait time.

The fix: decouple configuration from ingestion

The solution separates the concern of “what data do we have?” from “how do we interpret it?” Configuration changes (data type conversions, primary key selection, time column adjustments) now apply lazily during the materialization stage rather than forcing a data re-copy during ingestion.

Ingestion only re-runs when the source data itself has changed. If a data scientist is experimenting with different graph designs on the same underlying dataset, ingestion is skipped entirely on subsequent runs.

Impact of ingestion skip on iterative development
Scenario	Before	After	Improvement
Change primary key on large table	Full re-ingestion	Skip to materialization	~50% faster refresh
Adjust data type on a column	Full re-ingestion	Skip to materialization	~50% faster refresh
Modify time column selection	Full re-ingestion	Skip to materialization	~50% faster refresh
Source data actually changed	Full ingestion	Full ingestion	No change (correct behavior)

04

Optimization 2: Graph Engine Eager Loading

In the original pipeline, the graph engine (feature store, graph store, neighbor sampler) only began initializing after all table and edge materialization completed. On a database with 20 tables, the engine sat idle while the last few tables finished materializing, even though the first 18 tables were ready minutes ago.

The fix: progressive warm-up

The feature store and graph store now begin loading as soon as each individual table or edge set finishes materializing. Instead of a synchronization barrier that waits for the slowest table, the engine progressively warms up during the materialization phase.

By the time the last table or edge set completes, the engine has already loaded the vast majority of the graph. The remaining startup time is just the delta: loading the final few artifacts rather than the entire graph from scratch.

Pipeline with eager loading (overlap between stages)

1

Ingestion

Copy and clean source data

→

2

Materialization

Tables and edges processed in parallel

→

3

Eager Engine Load

Feature/graph stores load progressively as each table/edge completes

→

4

Training Starts Earlier

30+ minutes saved on medium-to-large graphs

The practical impact scales with graph size. For small graphs with only a few tables, the engine loads quickly regardless. For medium-to-large graphs with dozens of tables and complex foreign key relationships, eager loading means training starts 30+ minutes earlier than it did before.

05

Optimization 3: Consistent Partitioning for Edge Reuse

Edge materialization is one of the most expensive stages in the pipeline. It performs join operations across tables to produce edge lists from foreign key relationships. In repeated training scenarios (retraining on fresh data daily or hourly), much of this work is redundant: if neither the source table nor the target table has changed, the edges between them are identical.

The obvious solution is caching: store previously computed edges and reuse them when the inputs have not changed. But naive caching introduces a subtle and dangerous correctness risk.

The consistency problem

During table materialization, each row is assigned a contiguous integer index. These indices are what edges reference: edge (3, 7) means “row 3 in the source table connects to row 7 in the target table.” If the indexing is not perfectly deterministic, the same row could receive index 3 on one run and index 5 on the next. Cached edges would then point to the wrong rows, producing silently incorrect training data.

In distributed compute environments (Spark, Databricks, Snowflake), deterministic indexing is not guaranteed by default. Shuffle operations, parallelism settings, and Parquet block sizes can all change row ordering between runs.

The fix: deterministic indexing

Kumo built a deterministic indexing system that enforces consistent partitioning across all supported compute backends. The system locks down:

Shuffle configurations: fixed partition counts and hash functions so rows land in the same partitions across runs
Sort orders: explicit ordering within each partition ensures row sequence is reproducible
Parquet block sizes: fixed block boundaries prevent row reordering during serialization and deserialization

With deterministic indexing in place, the system can safely compare the current run's inputs against the cached edge data. If the primary keys, time columns, and source data for both tables in a foreign key relationship are unchanged, the cached edges are reused directly.

Edge reuse conditions
Condition	Edge Reuse?	Rationale
Source data unchanged, same PK/time columns	Yes	Deterministic indexing guarantees identical edge lists
Source data changed, same PK/time columns	No	New rows produce new indices and new edges
Same source data, different PK selected	No	Different PK changes the join semantics
Same source data, different time column	No	Temporal edges depend on the time column choice

06

Benchmark Results

Kumo validated the combined optimizations using a synthetic benchmark based on the H&M retail dataset, mirroring common production usage patterns. The benchmark framework automatically logs timing data across pipeline stages via Jenkins integration, with Grafana dashboards planned for continuous monitoring.

Synthetic benchmark (H&M dataset)

The combined effect of all three optimizations on the H&M benchmark:

Combined optimization impact (H&M benchmark)
Optimization	Target Stage	Standalone Impact
Ingestion skip	Ingestion	~50% reduction in refresh time during iterative development
Eager loading	Engine startup	30+ minutes saved on medium-to-large graphs
Consistent partitioning	Edge materialization	Full edge reuse when source data is unchanged
All three combined	End-to-end	~25% total runtime reduction

Production customer results

The optimizations were validated on two production customer workloads with different characteristics:

Production customer results
Customer	Workload Type	Improvement	Notes
Food delivery company	Repeated daily training	~25%	Graph construction was a large share of total runtime
Social media platform	Massive-scale graphs	12.5%	Model training dominates runtime; optimizations target pre-training stages

The variance between 12.5% and 25% is informative. The food delivery company's workload spent a larger fraction of total runtime on ingestion and materialization, so optimizing those stages had a proportionally bigger impact. The social media platform runs massive-scale graphs where model training itself is the dominant cost. Even there, a 12.5% end-to-end improvement is significant when training runs take hours.

07

Practical Implications and Availability

These three optimizations address the two most common production workflows: iterative experimentation and repeated retraining. The improvements are purely in the data preparation layer. No model architecture changes, no accuracy tradeoffs, no changes to the training algorithm.

For data scientists iterating on graph design

Ingestion skip is the highest-impact change. Every time you adjust a primary key, change a data type, or modify a time column, the system no longer re-copies data from your warehouse. Experimentation cycles that previously took an hour now complete in roughly half that time. This compounds across a typical development session where a data scientist might try 10-20 configuration variants.

For production teams retraining models

Consistent partitioning and eager loading deliver the bulk of the value. Edge reuse eliminates redundant computation when most tables have not changed between training runs. Eager loading ensures the graph engine is warmed up by the time the last table finishes materializing. Together, these reduce the wall-clock gap between “new data arrives” and “model training begins.”

Availability

Kumo v2.11: Initial optimizations (ingestion skip, early eager loading)
Kumo v2.12: Full optimization suite enabled, including consistent partitioning for edge reuse

All optimizations are enabled by default. No configuration changes are required.