TL;DR

  • The highest-value predictions in the enterprise (fraud blocked at swipe time, recommendations ranked based on the user's session context, risk scored on every login) share one hard constraint: the prediction has to be made in under 100 ms, on live signals.
  • Relational Models built on Graph Transformer architectures are state-of-the-art for these problems. But they're notoriously hard to scale and even harder to serve in real time. Only a few hyper-scalers (Netflix, Pinterest, Google) have "cracked it".
  • Kumo Online Serving is the first product offering to make predictions from fine-tuned Graph Transformer Models broadly accessible: at scale, and in real time.
  • Sub-100 ms latency at thousands of QPS, deployable anywhere (Kumo SaaS or the customer's cloud).
  • A novel two-stage train-then-distill architecture delivers the predictive power of a deep Graph Transformer (without running it live) at the cost and latency of running a shallow model 24/7.

Why Fine-Tuned Relational Models

Graph Transformers (RelGT), including recent Fine-Tuned Foundation Model variants like KumoRFM-2 and the Netflix Foundation Model, are the winning architecture for predictive AI for fraud detection, recommendations, risk scoring and a growing list of relational prediction problems (RelBench). They learn directly over messy, multi-tabular enterprise data, capturing patterns that traditional ML architectures and manually crafted features cannot detect. That's why hyperscalers (Netflix, Pinterest, Google) have spent years and hundreds of millions of dollars on R&D and infrastructure to scale them.

The Challenge: Hard to scale. Even harder to serve in real-time

Two obstacles stand between Relational Models built on Graph Transformer architectures and production systems. The first is scale: fine-tuning a Graph Transformer over terabytes of relational data is a significant engineering challenge, one Kumo solved years ago (see: Large-Scale Training of Graph Transformers). The second is online serving: delivering low-latency predictions at high throughput while incorporating real-time signals, such as in-session clicks or live transaction patterns, without sacrificing accuracy.

High-impact use cases that benefit from both graph architectures (capturing signals across complex relational data) and low-latency serving include:

  • Real-time fraud detection: credit card fraud, account takeover, payment risk
  • Real-time recommendations: feeds and content ranking that reflect a user's current session
  • Real-time credit decisions: default risk scored on quick-turnaround micro-loans

Across these use cases, Graph Transformers' advantage compounds most: every signal arriving between requests can change the prediction, and getting the prediction wrong, or too slow, shows up immediately in lost revenue or user trust.

Kumo Online Serving is the first product offering to solve both.

Kumo Online Serving In Action

Before jumping into the technical details, here's a concrete example: in-session recommendations for a food delivery application.

Video 1: Online-Serving in Action for Food Delivery Application. Personalized In-Session Adapted Recommendations using real-time signals for time, locations, clicks, views, and add to cart actions.

Without online serving enabled, the application gives me reasonable predictions based on where I normally live and what I've ordered before. The problem is that those recommendations miss the moment: it's a different time of the day, I'm on business travel in a different city, and I'm in the mood for something specific. My clicks and views could tell the model what I'm actually craving, but without real-time signals on the request path, the recommendations stay anchored to my long-term profile, so I end up filtering and searching.

With online serving enabled, the recommendations update sub-second on every click and signal from my current session. Within a few interactions, the feed surfaces exactly what I'm craving, with no additional postprocessing. The serving stack delivers this in real time at the scale of thousands of concurrent active user sessions, all deciding what to order for dinner.

Figure 1: Real-time signals from the user's session (time of day, location, clicks, views, and add-to-cart actions) feed into the online serving model and reshape recommendations in sub-second.

Introducing Our Two-Stage Fine-Tuning Architecture

A fine-tuned Relational Graph Transformer delivers the best predictive performance, but full graph traversal at inference time is computationally expensive (which translates to physical expense) and not viable at sub-100 ms latency. We developed a novel two-stage approach that produces a much smaller model with equivalent predictive capability, designed for real-time serving. Our approach is composed of two stages:

  • Stage 1: Fine-tune the Relational Model “Deep Model (referring to all entities in the graph)” on the full customer data (graph representing tables and relationships).
  • Stage 2: Train a much smaller model “Shallow Model” based on the embeddings produced from the model from Stage 1 (a process called distillation) which then gets deployed for online serving.

Figure 2: End-to-end Kumo Online Serving pipeline. Offline, Stage 1 fine-tunes the deep Relational Model and Stage 2 distills it into a shallow model. The distilled model is then deployed to Kumo SaaS or the customer's VPC for live online inference.

Note: The three offline workloads (fine-tuning the deep model, running batch inference to refresh entity embeddings, and training the distilled shallow model) can run at independent cadences. In practice, batch inference runs most frequently to keep the embeddings up to date (e.g., daily), while the deep and distilled models are refreshed less often (e.g., weekly or monthly).

Stage 1: Fine-Tuning the Relational Model

First, we fine-tune the Relational Model on the full relational graph. This model captures the entire behavioral history of every entity: multiple hops of connections, long time horizons, and rich cross-entity signals. It is the most expressive model Kumo builds, but it is designed for offline batch predictions, not real-time inference.

A key output of this stage, beyond predictions, is node embeddings: dense vectors that compress the deep model's multi-hop understanding of each entity (customer) and target (restaurant) into a representation that can be precomputed and cached.

Figure 3: Stage 1: The Deep Relational Model is fine-tuned on the full relational graph (past views, orders, ratings, profile information, and entity relationships), producing dense entity embeddings that compress the full user history into cached vectors.

Stage 2: Distilling the Fine-Tuned Model

We then train a much smaller model, the distilled shallow GNN, using two inputs:

  • Entity embeddings from the Deep Graph Transformer, encoding long-term, multi-hop behavioral context
  • Recent 1-hop interactions (e.g., recent orders and views), supplied at inference time

Figure 4: Stage 2: the shallow distilled model is trained on two inputs, the entity embeddings from the deep model and the most recent 1-hop interactions supplied at inference time.

The Shallow “distilled” model reuses Kumo's Graph Transformer backbone but runs without positional encodings of graph structure. With no graph to traverse, it only needs node-type and time encodings.

Figure 5: At inference time, historical context for the user enters the shallow distilled model as a precomputed embedding, alongside real-time signals from the current session, and the model returns a prediction in under 100 ms.

The embedding encodes deep historical context: multi-hop behavioral patterns that would otherwise require full graph traversal to compute. Recent interactions capture fast-changing signals, such as a user who just browsed a new category or a suspicious transaction pattern from the last few minutes. Together, they give the distilled model both long-term pattern recognition and real-time sensitivity. The result: a model up to 10x smaller, orders of magnitude faster, with no dependency on graph infrastructure, deployable in any serving stack with minimal dependencies.

2. Experimental Results

We validated the distilled model across fraud detection and recommendation tasks, on both open-source benchmarks and real customer datasets. In all cases, it matches the performance of much larger deep GNNs.

Chart 1: Predictive performance across the three benchmarks. On backtesting, the distilled model matches the full Fine-Tuned Graph Transformer (the offline model) and substantially outperforms the LightGBM baselines on every task.

Datasets

  • Credit card fraud: an open-source transaction-level fraud detection benchmark. We use the public AWS Fraud Dataset Benchmark, which incorporates the Kaggle credit card fraud dataset.
  • Large Food Delivery: a proprietary dataset. The task is to predict whether a notification impression is clicked, given transaction-level features, user and notification embeddings, and recent user and notification interactions.
  • Large Online Travel: a proprietary dataset. The task is to predict whether a booking transaction is fraudulent, given transaction-level features, user embeddings, and recent user interactions.

We use proprietary datasets for the food delivery and online travel benchmarks in the absence of public datasets at comparable scale.

Infrastructure

A great model is only useful if you can serve it. The distilled architecture gives us a model that is small and graph-free; the infrastructure is what turns that into a service that takes a JSON request and returns a prediction in under 100 ms, at thousands of requests per second, behind a single authenticated endpoint.

The Stack

Figure 6: The Kumo Online Serving stack: NVIDIA Triton Inference Server (inference and dynamic batching) on KServe (Kubernetes-native model lifecycle), running identically on EKS, GKE, or AKS inside the customer's cloud account.

We run online serving on NVIDIA Triton Inference Server, deployed via KServe on Kubernetes. Triton handles inference and uses dynamic batching to coalesce concurrent requests into GPU-friendly batch sizes, keeping GPU utilization high and latency predictable. KServe provides Kubernetes-native model lifecycle management: pulling artifacts from object storage, scaling pods with load, and producing a deployment shape that runs identically on EKS, GKE, or AKS. This matters because online serving ultimately lives inside the customer's cloud account, near their application. A lean control plane sits on top and exposes a purposeful API for the workflows that matter most: ship a new version, run it as a canary, promote it, monitor it, roll it back.

Canary Deployment

Rather than swapping model versions in place (which pushes coordination onto every consuming team), our control plane treats version upgrades as a first-class workflow. One call deploys the new version as a canary behind the existing endpoint. Subsequent calls ramp up traffic as confidence grows. Promotion swaps it in as the primary, all without the endpoint URL changing or going unavailable. Rolling back is the same workflow in reverse.

Managing models in production

Triton emits per-model latency, throughput, and batching metrics; KServe surfaces pod health and autoscaling state. The control plane publishes everything into Prometheus with pre-built Grafana dashboards that answer the questions that come up during a rollout: is the new variant healthy, is autoscaling keeping up with load, and where is latency being spent? The result is that a new model reaches production through a sequence of small, observable, single-call steps.

Deployment Performance

We benchmarked the distilled serving stack on a large food-delivery model using one g6.4xlarge predictor pod (one NVIDIA L4 GPU, 16 vCPU, 64 GB RAM).

Setup

HardwareOne g6.4xlarge predictor pod per replica: one NVIDIA L4 GPU, 16 vCPU, 64 GB RAM.
PayloadOne entity ID plus a variable number of recent 1-hop interactions, N ∈ {4, 8, 16, 32} per ragged table; ~4.5–12 KB JSON. High-cardinality string fields (session IDs, event IDs) drawn from finite pools sized to a realistic 3-minute production window.
Load generationOpen-loop traffic (constant arrival rate) against the inference endpoint, round-robined across 128 pre-generated payload files so the client never replays bytewise-identical requests.
Latency reportingEnd-to-end at the client: wall time from request out to response in, including TLS, ingress, auth, and inference.
Measurement protocol30 s warm-up + 3 min measurement window per load step. Autoscaling disabled: replica count held constant so per-pod and scaling questions are isolated from autoscaler dynamics.

Single-instance throughput and latency

On one g6.4xlarge AWS instance with NVIDIA L4, a relatively older generation and cost effective GPU, predictor pod, sweeping offered QPS from low through saturation:

A single pod sustains ~1,000 QPS at p99 ≤ 75 ms, with latency staying flat from 200 through 1,000 QPS before crossing the 100 ms budget at 1,100 QPS. Zero errors throughout.

Chart 2: Latency versus offered QPS on one g6.4xlarge predictor pod. Latency stays flat from 200 through 1,000 QPS before crossing the 100 ms budget at 1,100 QPS.

Horizontal scaling

We held per-pod load constant at 800 QPS and scaled from 1 to 8 replicas:

Chart 3: Aggregate throughput and p99 latency from 1 to 8 replicas at 800 QPS per pod. Throughput scales near-linearly while p99 holds inside the 100 ms budget at every step (6,400 QPS at p99 ≈ 92 ms with 8 pods).

Throughput scales near-linearly through 8 pods, with p99 holding inside the 100 ms budget at every step. Each pod is a self-contained inference unit with no shared state and no cross-pod coordination, so scaling is simply adding pods. At 8 pods: 6,400 QPS aggregate at p99 ≈ 92 ms, with zero errors.

No graph infrastructure on the request path means no graph database to size, replicate, or maintain. This is the payoff of the two-stage architecture: the representational power of a deep relational model, at the cost and latency profile of a shallow one.

Get Started

Acknowledgements

We thank the entire Kumo team for their invaluable support. Special thanks go to Myungwhan Kim, Manush Murali, Hema Raghavan, Zack Drach, Raja Rao DV, Effy Fang, and Abdullah Al-chihabi.