trainer).
This walkthrough assumes you completed introduction—you can use a connector, build a Graph, define a PredictiveQuery, and call fit(). On top of that, it covers distillation, batch prediction for embeddings, artifact export, and how those connect to inference, using one concrete graph below.
Running example: H&M retail graph
The three-table retail pattern below (customers, transactions, articles) is used previously inintroduction and many Kumo examples. The scenario is inspired by the H&M personalized fashion recommendation dataset from RelBench (see the RelBench repository ).
The fraud label and transaction_id primary key in the code below are illustrative.
Why two predictive queries?
The graph above uses twoPredictiveQuery objects on the same graph:
- “pq_churn“ — Trains the deep GNN on churn at the customer level (10-day activity). That job produces node embeddings you reuse when exporting artifacts for online serving.
- “pq_fraud“ — Defines the serving task (fraud on each transaction) that the distilled model will score at low latency.
End-to-end flow
- Train the base GNN with
pq_churn(churn oncustomers). - Distill with
pq_fraud(fraud ontransactions) usingsuggest_distilled_model_plan(..., base_model_id=...)andDistillationTrainer. - Batch-predict embeddings with
load()on the base job id, a prediction table frompq_churn, andoutput_typesincludingembeddings—export consumes this batch job. - Export with `export_model()
: ``training_job_id`` is the **distilled** job, ``batch_prediction_job_id`` is that embedding job, ``output_path` is your bundle prefix (e.g. S3). - Deploy and infer from the exported artifacts; managed hosting is set up through Kumo.
Step 1 — Train the base (deep) model on churn
UseTrainer and fit() on pq_churn’s training table and suggested model plan. Save base_job_id.
Step 2 — Suggest and train the distilled fraud model
On the fraud query’sPredictiveQuery object (pq_fraud in the example), call suggest_distilled_model_plan() with base_model_id=base_job_id. The platform checks graph/encoder alignment and that embedding keys resolve to base entities (in the example, customers is a base entity, so each transaction uses the deep model’s customer embedding via transactions.customer_id).
Train with DistillationTrainer on the same graph and the fraud query’s training table—not the churn table. The YAML block below shows the shape of the distillation section inside the returned DistilledModelPlan. Compare those fields to the distillation block in your plan object—numeric offsets, keys, and hop strings can differ from the example.
Example distillation section
The following YAML is not something you paste into the SDK by hand; it mirrors part of what suggest_distilled_model_plan() returns inside DistilledModelPlan.
- “embedding_keys“ — Foreign keys on the fact row (here
transactions) that point to base entities whose deep embeddings are attached for distillation (heretransactions.customer_id→ customer embedding). - “max_embedding_offset“ / “min_embedding_offset“ — How far back (and how “fresh”) the base embedding is allowed to be relative to the prediction time; your plan may use different values.
- “real_time_offset“ — How RTI history is anchored in time relative to the request (confirm on your plan).
- “real_time_interactions“ — Maps an RTI hop key (a path through the graph) to a maximum sequence length (here 32 recent transactions along the hop below). That same hop string appears again in Triton input tensor names at inference.
Step 3 — Batch prediction for embeddings (export input)
The export step needs a finished batch prediction job whose outputs include embeddings from the base (churn) model; seeend_to_end_flow. Load the base trainer, build a churn prediction table, and call predict() with embeddings in output_types (details in Batch Prediction in trainer). Keep bp_job_id.
Step 4 — Export artifacts
export_model() (also kumoai.export_model) with ModelOutputConfig copies the online serving model directory and bundles embeddings.parquet from bp_job_id into output_path; see end_to_end_flow. Use non_blocking=True for an ArtifactExportJob, or False to block until ArtifactExportResult.
Object storage: Export targets S3-style URIs (s3://…) in typical flows. Contact Kumo if you need to export to another blob store.
The
export_model / ModelOutputConfig API does not ask you for a model name string. A fixed serving-side name (for example online-model) can be applied when Kumo wires your bundle into managed inference, without changing this SDK call.Step 5 — Deploy and run inference
Step 4 produces a Triton model repository: the online serving model layout plus bundledembeddings.parquet (and related artifacts), ready to load in NVIDIA Triton Inference Server. See the Triton Inference Server documentation for how Triton loads model repositories and exposes the HTTP/gRPC V2 inference API.
Managed deployment: Hosting exported model artifacts in production via KServe with Triton is arranged through Kumo. Contact your Kumo team for setup, URLs, and authentication.
Self-managed deployment: If you already operate a Triton-compatible inference stack, Kumo can provide a container image and guidance to run it with the artifacts from Step 4. Contact your Kumo team for details.
Request shape (example below, batch size 1): Inputs are a flat list of named tensors, exactly as in your exported config.pbtxt.
- “anchor_time“ — INT64, shape
[1, 1], nanoseconds since Unix epoch. - Fact row —
{table}.{column}on the scored entity (in the example,transactions.*), including the embedding foreign key (e.g.transactions.customer_id). - RTI history —
{RTI_key}:{column}whereRTI_keymatchesreal_time_interactionsin the distillation plan (in the example, the three-segment hoptransactions.customer_id->customers.customer_id->transactions.customer_id). Use the same feature columns as on the fact row when both are modeled (e.g.price,article_id,t_dat). Shape[1, seq_len, 1]with actualseq_len(no zero-padding to the configured max such as 32).
config.pbtxt.
42 is the customer id whose base-model embedding comes from embeddings.parquet. Fact tensors describe the current transaction; RTI tensors use seq_len = 3 (three prior interactions). String/categorical columns may use BYTES in Triton—follow your generated config.
Example infer call (Triton V2 HTTP on localhost:8000; replace host, port, and model name—see NVIDIA’s Triton docs for response format and errors):
See also
introduction— start here if you have not built a graph and trained a first model yet.trainer— trainer, batch prediction, distillation, and artifact export API reference.