Kumo’s online serving runs predictions at request time — suitable for live recommendations, real-time fraud scoring, and any use case where you need a score in milliseconds rather than a scheduled job. Unlike batch prediction, which scores your entire dataset at once, online serving keeps a live endpoint running that you query with a single entity at a time.Documentation Index
Fetch the complete documentation index at: https://kumo.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
The deployment steps below mirror the interactive SDK example notebook, which you can download and run end to end.
How it works
Online serving in Kumo combines two models:- A base model trained on your full graph that produces rich entity embeddings capturing long-term patterns.
- A distilled model that runs at request time, combining those stored embeddings with the latest signals to produce a fast, accurate prediction.
- Train a base model on your graph.
- Train a distilled model using the base model’s embeddings.
- Generate embeddings from the base model and export the serving bundle to S3.
- Deploy the bundle to a live inference endpoint.
- Query the endpoint from your application.
kumoai.online deployment SDK.
Train the base model
Train your base model as usual — this is the same workflow covered in Training & Predictions. Save the job ID; you’ll need it in the next step.Train the distilled model
The distilled model is a smaller, faster model trained to predict at request time. It takes the base model’s embeddings as inputs — passbase_model_id to link the two.
pq_serving is a PredictiveQuery on the same graph as pq, targeting the entity you want to score at request time — for example, a transaction or a recommendation candidate.
Generate embeddings and export
Run batch prediction on the base model to produce entity embeddings, then export both the distilled model and the embeddings to S3 as a ready-to-deploy bundle.Export targets S3 URIs (
s3://…). Contact your Kumo team if you need to export to a different storage provider.Connect to the deployment control plane
With your bundle in S3, switch to thekumoai.online SDK to deploy and manage the live service.
Install the SDK:
| Variable | What it is |
|---|---|
BASE_URL | Your control-plane API URL |
CLIENT_ID | Cognito app client ID |
CLIENT_SECRET | Cognito app client secret |
TOKEN_URL | Cognito token endpoint |
Register your model
Point the SDK at the S3 bundle from Step 3. The last path segment becomes the model name.Deploy an inference service
Choose a GPU instance type and create the service:g6.4xlarge is a good starting point for most models. The service takes 1–5 minutes to start — poll until it’s ready:
Run inference
Send a request to your live endpoint. Input names and shapes come from your exported model’sconfig.pbtxt.
Clean up
Delete the service and registered model when you’re done to avoid ongoing costs:Next Steps
The example notebook covers two additional features once you’re comfortable with the basics. Autoscaling — automatically scale replicas based on CPU usage:svc.promote() to make the canary the new stable version. If something goes wrong, call svc.rollback() to route all traffic back to the original.
See also
- Introduction — set up connectors, tables, graphs, and predictive queries.
- Training & Predictions — train models and generate batch predictions.