
Accelerating Graph Construction
Three optimizations that cut production graph transformer pipeline times by 12-25%: ingestion skip, eager loading, and consistent partitioning for edge reuse.
The Bottleneck: Graph Construction in Production
Kumo's Graph Transformer platform turns relational databases into temporal heterogeneous graphs, then runs transformer-based models over those graphs. Model training and inference get most of the attention, but in production the upstream stages (ingestion, materialization, graph engine startup) often consume a significant share of the total wall-clock time.
Two usage patterns make this especially painful:
- Iterative development. Data scientists experiment rapidly: modifying which tables to include, adjusting primary keys, tweaking time columns, changing data types. Each configuration change historically triggered a full pipeline re-run, even when the underlying source data had not changed.
- Repeated production training. Models retrain on fresh data daily or hourly. In most cases only a handful of tables contain new rows, while the majority remain unchanged. Yet the system rebuilt everything from scratch on every run.
The result: teams spent more time waiting for data preparation than for actual model training. Three targeted optimizations, released in Kumo versions 2.11 and 2.12, address this directly. Combined, they deliver 12-25% faster end-to-end training times, averaging around 20% improvement across customer workloads.
Anatomy of the Kumo Pipeline
Before examining the optimizations, it helps to understand the stages that every Kumo training run executes. The pipeline has three major phases, each with distinct compute characteristics.
Stage 1: Ingestion
Ingestion copies data from the customer's source system (Snowflake, Databricks, S3) into Kumo's secure data plane. The goal is to create a stable snapshot that downstream stages can process without contending with live updates. During ingestion, the system also performs data type conversion for each column and data cleaning (filtering null values in primary keys and timestamps).
Stage 2: Table materialization
Materialization transforms the ingested tabular data into tensor format suitable for graph processing. Each table's rows are assigned contiguous integer indices, and features are encoded and stored in PyTorch Geometric's remote backend. This stage converts human-readable database rows into the dense numerical representations the model consumes.
Stage 3: Edge materialization
Foreign key relationships between tables become edges in the graph. Edge materialization performs join operations on primary-foreign key pairs to produce edge lists. These are stored in a remote graph store. For large databases with many foreign key relationships, this stage can be the most expensive part of the pipeline.
Stage 4: Graph engine startup
The graph engine (feature store, graph store, and neighbor sampler) must load all materialized data before training can begin. In the original design, this loading phase waited until all materialization completed, creating idle time between the end of materialization and the start of training.
Kumo training pipeline (before optimization)
Ingestion
Copy from Snowflake / Databricks / S3, type conversion, null filtering
Table Materialization
Convert rows to tensor format, assign contiguous indices
Edge Materialization
Join on FK relationships to produce edge lists
Graph Engine Load
Load feature store, graph store, sampler (waits for all materialization)
Model Training
Graph transformer forward/backward passes
Optimization 1: Ingestion Skip
The first optimization targets a specific pain point in iterative development. When a data scientist changes a table's configuration (adjusting a data type, selecting a different primary key, modifying the time column), the original system triggered a full re-ingestion of that table from the source system.
This was wasteful. The source data itself had not changed. Only the interpretation of that data (which column is the primary key, what type to assign a column) had changed. Re-copying terabytes from Snowflake or Databricks just because a data scientist tweaked a configuration parameter added minutes or hours of unnecessary wait time.
The fix: decouple configuration from ingestion
The solution separates the concern of “what data do we have?” from “how do we interpret it?” Configuration changes (data type conversions, primary key selection, time column adjustments) now apply lazily during the materialization stage rather than forcing a data re-copy during ingestion.
Ingestion only re-runs when the source data itself has changed. If a data scientist is experimenting with different graph designs on the same underlying dataset, ingestion is skipped entirely on subsequent runs.
| Scenario | Before | After | Improvement |
|---|---|---|---|
| Change primary key on large table | Full re-ingestion | Skip to materialization | ~50% faster refresh |
| Adjust data type on a column | Full re-ingestion | Skip to materialization | ~50% faster refresh |
| Modify time column selection | Full re-ingestion | Skip to materialization | ~50% faster refresh |
| Source data actually changed | Full ingestion | Full ingestion | No change (correct behavior) |
Optimization 2: Graph Engine Eager Loading
In the original pipeline, the graph engine (feature store, graph store, neighbor sampler) only began initializing after all table and edge materialization completed. On a database with 20 tables, the engine sat idle while the last few tables finished materializing, even though the first 18 tables were ready minutes ago.
The fix: progressive warm-up
The feature store and graph store now begin loading as soon as each individual table or edge set finishes materializing. Instead of a synchronization barrier that waits for the slowest table, the engine progressively warms up during the materialization phase.
By the time the last table or edge set completes, the engine has already loaded the vast majority of the graph. The remaining startup time is just the delta: loading the final few artifacts rather than the entire graph from scratch.
Pipeline with eager loading (overlap between stages)
Ingestion
Copy and clean source data
Materialization
Tables and edges processed in parallel
Eager Engine Load
Feature/graph stores load progressively as each table/edge completes
Training Starts Earlier
30+ minutes saved on medium-to-large graphs
The practical impact scales with graph size. For small graphs with only a few tables, the engine loads quickly regardless. For medium-to-large graphs with dozens of tables and complex foreign key relationships, eager loading means training starts 30+ minutes earlier than it did before.
Optimization 3: Consistent Partitioning for Edge Reuse
Edge materialization is one of the most expensive stages in the pipeline. It performs join operations across tables to produce edge lists from foreign key relationships. In repeated training scenarios (retraining on fresh data daily or hourly), much of this work is redundant: if neither the source table nor the target table has changed, the edges between them are identical.
The obvious solution is caching: store previously computed edges and reuse them when the inputs have not changed. But naive caching introduces a subtle and dangerous correctness risk.
The consistency problem
During table materialization, each row is assigned a contiguous integer index. These indices are what edges reference: edge (3, 7) means “row 3 in the source table connects to row 7 in the target table.” If the indexing is not perfectly deterministic, the same row could receive index 3 on one run and index 5 on the next. Cached edges would then point to the wrong rows, producing silently incorrect training data.
In distributed compute environments (Spark, Databricks, Snowflake), deterministic indexing is not guaranteed by default. Shuffle operations, parallelism settings, and Parquet block sizes can all change row ordering between runs.
The fix: deterministic indexing
Kumo built a deterministic indexing system that enforces consistent partitioning across all supported compute backends. The system locks down:
- Shuffle configurations: fixed partition counts and hash functions so rows land in the same partitions across runs
- Sort orders: explicit ordering within each partition ensures row sequence is reproducible
- Parquet block sizes: fixed block boundaries prevent row reordering during serialization and deserialization
With deterministic indexing in place, the system can safely compare the current run's inputs against the cached edge data. If the primary keys, time columns, and source data for both tables in a foreign key relationship are unchanged, the cached edges are reused directly.
| Condition | Edge Reuse? | Rationale |
|---|---|---|
| Source data unchanged, same PK/time columns | Yes | Deterministic indexing guarantees identical edge lists |
| Source data changed, same PK/time columns | No | New rows produce new indices and new edges |
| Same source data, different PK selected | No | Different PK changes the join semantics |
| Same source data, different time column | No | Temporal edges depend on the time column choice |
Benchmark Results
Kumo validated the combined optimizations using a synthetic benchmark based on the H&M retail dataset, mirroring common production usage patterns. The benchmark framework automatically logs timing data across pipeline stages via Jenkins integration, with Grafana dashboards planned for continuous monitoring.
Synthetic benchmark (H&M dataset)
The combined effect of all three optimizations on the H&M benchmark:
| Optimization | Target Stage | Standalone Impact |
|---|---|---|
| Ingestion skip | Ingestion | ~50% reduction in refresh time during iterative development |
| Eager loading | Engine startup | 30+ minutes saved on medium-to-large graphs |
| Consistent partitioning | Edge materialization | Full edge reuse when source data is unchanged |
| All three combined | End-to-end | ~25% total runtime reduction |
Production customer results
The optimizations were validated on two production customer workloads with different characteristics:
| Customer | Workload Type | Improvement | Notes |
|---|---|---|---|
| Food delivery company | Repeated daily training | ~25% | Graph construction was a large share of total runtime |
| Social media platform | Massive-scale graphs | 12.5% | Model training dominates runtime; optimizations target pre-training stages |
The variance between 12.5% and 25% is informative. The food delivery company's workload spent a larger fraction of total runtime on ingestion and materialization, so optimizing those stages had a proportionally bigger impact. The social media platform runs massive-scale graphs where model training itself is the dominant cost. Even there, a 12.5% end-to-end improvement is significant when training runs take hours.
Practical Implications and Availability
These three optimizations address the two most common production workflows: iterative experimentation and repeated retraining. The improvements are purely in the data preparation layer. No model architecture changes, no accuracy tradeoffs, no changes to the training algorithm.
For data scientists iterating on graph design
Ingestion skip is the highest-impact change. Every time you adjust a primary key, change a data type, or modify a time column, the system no longer re-copies data from your warehouse. Experimentation cycles that previously took an hour now complete in roughly half that time. This compounds across a typical development session where a data scientist might try 10-20 configuration variants.
For production teams retraining models
Consistent partitioning and eager loading deliver the bulk of the value. Edge reuse eliminates redundant computation when most tables have not changed between training runs. Eager loading ensures the graph engine is warmed up by the time the last table finishes materializing. Together, these reduce the wall-clock gap between “new data arrives” and “model training begins.”
Availability
- Kumo v2.11: Initial optimizations (ingestion skip, early eager loading)
- Kumo v2.12: Full optimization suite enabled, including consistent partitioning for edge reuse
All optimizations are enabled by default. No configuration changes are required.
Try KumoRFM on your own data
Zero-shot predictions are free. Fine-tuning is available with a trial.