What is the difference between real-time and batch ML predictions?

Batch predictions are computed on a schedule (hourly, daily, weekly) for all entities at once and stored for later retrieval. Real-time predictions are computed on demand when a request arrives, using the latest available data. Batch is simpler and cheaper to operate but uses stale data. Real-time is more complex and expensive but captures signals that only exist at the moment of decision.

When do you need real-time ML predictions?

Real-time predictions are necessary when the decision window is short (milliseconds to seconds) and the input data changes rapidly. Common use cases include fraud detection (transaction must be scored before authorization), dynamic pricing (price must reflect current demand), content recommendations (user intent changes with each click), and credit decisioning (approval at point of application).

Why is real-time ML serving so much harder than batch?

Three reasons: (1) feature computation must happen within the latency budget (typically under 100ms), requiring pre-computed feature stores or extremely fast aggregation; (2) infrastructure must handle variable traffic with consistent low latency, requiring auto-scaling and redundancy; and (3) the serving system must maintain consistency between training-time features and serving-time features, which is the most common source of prediction errors in production.

What is the training-serving skew problem?

Training-serving skew occurs when features computed during model training differ from features computed during production serving. This happens because training features are typically computed in batch (SQL queries on data warehouses) while serving features are computed in real-time (streaming aggregations on feature stores). Even small discrepancies in time windows, null handling, or aggregation logic cause silent accuracy degradation.

How do foundation models simplify real-time vs batch serving?

Foundation models like KumoRFM eliminate the feature engineering layer entirely, which removes the largest source of complexity in both batch and real-time serving. The model reads raw relational data directly, so there is no feature store to maintain, no feature pipelines to keep consistent, and no training-serving skew. The same model serves both batch (score all customers nightly) and real-time (score this customer now) modes with identical prediction logic.

Real-Time vs Batch ML Predictions: Architecture, Trade-offs, and When to Use Each | Kumo.ai

There is a persistent myth in enterprise ML that real-time predictions are always better than batch. If real-time is harder and more expensive, it must be more valuable. The reality is more nuanced. Some predictions need to be real-time. Many do not. And the architecture difference between the two is not just latency. It is an entirely different system design with different failure modes, cost structures, and operational complexity.

McKinsey estimates that only 15-20% of enterprise ML use cases genuinely require real-time inference. The rest can run in batch with no loss in business value. Getting this classification right saves millions in infrastructure and engineering time.

latency_requirements — common use cases

Use Case	Latency Budget	Data Freshness Needed	Mode	Annual Infrastructure Cost
Fraud detection	< 150ms	Real-time (seconds)	Real-time	$500K-5M
Dynamic pricing	< 200ms	Real-time (minutes)	Real-time	$200K-2M
Ad bidding	< 100ms	Real-time (seconds)	Real-time	$1M-10M
Churn prediction	Hours	Daily/weekly	Batch	$10K-50K
Demand forecasting	Hours	Daily	Batch	$20K-100K
Lead scoring	Minutes	Daily	Batch	$10K-30K
Credit portfolio risk	Hours	Monthly	Batch	$5K-20K

Only 15-20% of enterprise ML use cases genuinely require real-time inference (highlighted). The rest run in batch with no loss in business value.

cost_per_prediction — batch vs real-time at scale

Scale	Batch Cost/Prediction	Real-Time Cost/Prediction	Cost Multiplier
1M predictions/day	$0.002	$0.05	25x
10M predictions/day	$0.001	$0.03	30x
100M predictions/day	$0.0005	$0.015	30x
1B predictions/day	$0.0002	$0.01	50x

Real-time serving costs 25-50x more per prediction due to feature stores, model servers, load balancers, auto-scaling, and 24/7 monitoring.

Batch predictions: the workhorse

Batch prediction is straightforward. You run your model on a schedule (hourly, daily, weekly), score all relevant entities, and store the results in a database or cache. When the application needs a prediction, it reads the pre-computed result.

The architecture is simple: a scheduled job (Airflow, dbt, cron) triggers a feature computation pipeline, feeds the features to the model, writes the predictions to a table, and the application reads from that table. Each component is well-understood, easy to monitor, and easy to debug.

Where batch works well: churn prediction (scored daily or weekly, actioned through marketing campaigns), demand forecasting (scored daily, used for inventory planning), lead scoring (scored when new leads enter the CRM), credit portfolio risk (scored monthly for regulatory reporting), and customer segmentation (scored weekly, used for targeted offers).

In every case, the prediction is consumed hours or days after computation. The data does not change meaningfully between scoring and action. Batch is the right architecture.

Batch advantages

Cost efficiency is the primary advantage. You run compute once and serve results from a database read. At scale, this means $0.001 per prediction versus $0.01-0.10 per real-time prediction. For a retailer scoring 50 million products daily, that is the difference between $50K/year and $500K-5M/year in serving costs alone.

Operational simplicity is the second advantage. If a batch job fails, you re-run it. If a prediction is wrong, you debug it offline. There is no pager going off at 3am because real-time inference latency spiked during a traffic surge.

Debugging is the third advantage. You can inspect every feature, every model input, and every prediction in a batch run. The entire pipeline is reproducible. Real-time systems are notoriously hard to debug because the state that produced a given prediction is ephemeral.

Real-time predictions: when seconds matter

Real-time prediction means the model runs at request time, using the latest available data, and returns a result within milliseconds. The architecture is fundamentally different: an API receives a request, retrieves features from a feature store (pre-computed) and computes on-demand features from streaming data, feeds them to a model served on GPU or CPU infrastructure, and returns the result within a latency budget (typically 50-200ms end to end).

Where real-time is necessary: fraud detection (transaction must be scored before the $150ms authorization window closes), dynamic pricing (hotel room price must reflect current demand when the user loads the page), content recommendations (user intent changes with every click), ad bidding (bid must be computed within the 100ms auction window), and credit decisioning at point of sale (approval within seconds).

The feature store problem

The hardest part of real-time ML is not model serving. It is feature serving. A fraud detection model might need features like "number of transactions by this card in the last 15 minutes," "average transaction amount for this merchant in the last hour," and "number of distinct merchants this card has used today."

These features require real-time aggregation over streaming data. You need a system that maintains sliding-window aggregates with sub-second freshness. Tools like Tecton, Feast, and Redis handle this, but they cost $100K-500K/year and require dedicated engineering teams to maintain.

Worse, you need to ensure that the features computed in real-time match exactly what the model saw during training. This is the training-serving skew problem, and it is the number-one source of silent prediction errors in production ML systems.

Here is a concrete example of how training-serving skew causes silent failures:

training_time_features — computed in batch SQL

customer_id	txn_count_7d	avg_amount_7d	distinct_merchants_7d
C-5001	12	$84.20	6
C-5002	3	$210.50	2
C-5003	8	$45.00	5

Training features computed via SQL window functions on the data warehouse. The 7-day window is calendar-precise: exactly 168 hours of data.

serving_time_features — computed by feature store in real time

customer_id	txn_count_7d	avg_amount_7d	distinct_merchants_7d
C-5001	11	$86.40	6
C-5002	3	$210.50	2
C-5003	7	$47.10	4

Serving features computed by streaming aggregation with 5-minute event lag. C-5001 is missing 1 recent transaction. C-5003 has a different count and average because a late-arriving event was included at training time but not at serving time.

prediction_impact — skew effect on fraud scores

customer_id	training_score	serving_score	drift	consequence
C-5001	0.12 (safe)	0.09 (safe)	-0.03	Minor: still safe
C-5002	0.87 (fraud)	0.87 (fraud)	0.00	None: exact match
C-5003	0.71 (flag)	0.58 (borderline)	-0.13	Missed: fraud not flagged

C-5003's serving features differ enough to drop the fraud score below the alert threshold. The model learned that 8 transactions at 5 merchants in 7 days is suspicious, but serving-time data shows 7 at 4. A real fraud case slips through.

Infrastructure complexity

Real-time serving requires: load balancers to distribute requests, auto-scaling to handle traffic spikes (Black Friday, flash sales, market events), redundancy across availability zones, circuit breakers and fallback logic for when the model is slow or unavailable, and request queuing to handle burst traffic. A batch system needs none of this.

The cost difference is substantial. Stripe has reported spending over $100M annually on ML infrastructure, with a large portion dedicated to real-time fraud scoring. Netflix spends an estimated $50M/year on recommendation serving infrastructure. For most enterprises, real-time ML infrastructure costs 5-10x what batch infrastructure costs for the same number of predictions.

Batch predictions

Run on schedule (hourly/daily/weekly)
$0.001 per prediction at scale
Simple pipeline: compute, store, read
Easy to debug and reproduce
Stale data (hours to days old)

Real-time predictions

Run on demand (50-200ms latency)
$0.01-0.10 per prediction at scale
Complex: feature store, model server, load balancer
Hard to debug ephemeral state
Fresh data (seconds to minutes old)

The hybrid architecture

Most production systems use both. Netflix runs batch recommendations for the homepage (pre-computed nightly for all users) and real-time re-ranking when you start browsing (adjusting based on your current session). Uber runs batch demand forecasts for capacity planning and real-time surge pricing for current rides.

The pattern is consistent: use batch as the baseline and layer real-time on top for the moments where freshness creates measurable value. This hybrid approach captures 80-90% of the value of full real-time at 30-40% of the infrastructure cost.

The challenge with hybrid architectures is maintaining two separate serving paths. The batch path has its own feature pipeline, its own scheduling, and its own storage. The real-time path has its own feature store, its own serving infrastructure, and its own monitoring. Both need to produce consistent predictions, which means features must be computed identically in both paths.

PQL Query

-- Batch mode: score all customers nightly
PREDICT churn_30d FOR EACH customers.customer_id

-- Real-time mode: score one customer at request time
PREDICT fraud_probability FOR transactions.txn_id = 'T-90042'

The same model and the same query language serve both batch and real-time modes. No separate feature pipelines, no training-serving skew, no dual architecture to maintain.

Output

mode	entity	prediction	latency	feature_pipeline
Batch	120K customers	churn_30d scores	45 seconds total	None
Real-time	1 transaction	fraud_probability: 0.87	120ms	None

How foundation models simplify both modes

The complexity in both batch and real-time serving comes primarily from the feature layer. Feature engineering, feature stores, feature pipelines, training-serving consistency. Remove the feature layer, and the architecture simplifies dramatically.

A relational foundation model like KumoRFM reads raw relational data directly. There is no feature engineering step. The model connects to your data warehouse or database, understands the schema through foreign keys, and generates predictions from the raw table structure using graph transformer architecture.

For batch: point the model at your database, specify the prediction target, and score all entities. No feature pipeline to build, no feature store to maintain, no Airflow DAGs to debug.

For real-time: the model can serve predictions on individual entities as requests arrive. Because it reads directly from the data, there is no training-serving skew. The model sees the same data structure at training time and serving time.

The same model handles both modes. You are not maintaining two parallel architectures. One model, one data connection, two serving patterns.

Making the decision

Four questions determine whether a use case needs real-time predictions:

1. What is the decision window? If the decision happens within seconds of the triggering event (a transaction, a page load, an API call), you need real-time. If the decision happens hours or days later (a marketing campaign, a staffing plan, a risk report), batch is sufficient.

2. How fast does the input data change? If the features that drive the prediction change between scoring and action, you need real-time. A fraud model using "transactions in the last 5 minutes" is stale within minutes. A churn model using "purchases in the last 90 days" is fresh for days.

3. What is the value of freshness? Even when real-time is technically possible, the incremental accuracy gain may not justify the cost. If your batch churn model is 92% accurate and a real-time version would be 93%, the 1% improvement may not justify 5-10x the infrastructure spend.

4. Can you afford the operational complexity?Real-time ML requires 24/7 monitoring, on-call rotations, and rapid incident response. If your team is already stretched maintaining batch pipelines, adding real-time is likely to create more problems than it solves.

Start with batch. Prove the model delivers value. Then identify the specific use cases where freshness creates measurable uplift, and invest in real-time serving only for those. This sequence reduces risk, controls cost, and ensures that your real-time infrastructure investment is justified by demonstrated business impact.

Key Takeaways

1Only 15-20% of enterprise ML use cases genuinely require real-time inference. Most predictions (churn, demand, lead scoring, portfolio risk) run in batch with no loss in business value.
2Real-time serving costs 25-50x more per prediction than batch due to feature stores, model servers, auto-scaling, and 24/7 monitoring infrastructure.
3Training-serving skew is the number-one source of silent prediction errors in production. Features computed differently at training time and serving time cause accuracy degradation that is invisible without careful auditing.
4Foundation models eliminate the feature layer entirely, removing the largest source of complexity in both batch and real-time serving. No feature store, no feature pipelines, no skew.
5Start with batch, prove value, then add real-time only for use cases where data freshness creates measurable uplift. This sequence controls cost while demonstrating ROI.

Real-Time vs Batch ML Predictions: Architecture, Trade-offs, and When to Use Each