There is a persistent myth in enterprise ML that real-time predictions are always better than batch. If real-time is harder and more expensive, it must be more valuable. The reality is more nuanced. Some predictions need to be real-time. Many do not. And the architecture difference between the two is not just latency. It is an entirely different system design with different failure modes, cost structures, and operational complexity.
McKinsey estimates that only 15-20% of enterprise ML use cases genuinely require real-time inference. The rest can run in batch with no loss in business value. Getting this classification right saves millions in infrastructure and engineering time.
latency_requirements — common use cases
| Use Case | Latency Budget | Data Freshness Needed | Mode | Annual Infrastructure Cost |
|---|---|---|---|---|
| Fraud detection | < 150ms | Real-time (seconds) | Real-time | $500K-5M |
| Dynamic pricing | < 200ms | Real-time (minutes) | Real-time | $200K-2M |
| Ad bidding | < 100ms | Real-time (seconds) | Real-time | $1M-10M |
| Churn prediction | Hours | Daily/weekly | Batch | $10K-50K |
| Demand forecasting | Hours | Daily | Batch | $20K-100K |
| Lead scoring | Minutes | Daily | Batch | $10K-30K |
| Credit portfolio risk | Hours | Monthly | Batch | $5K-20K |
Only 15-20% of enterprise ML use cases genuinely require real-time inference (highlighted). The rest run in batch with no loss in business value.
cost_per_prediction — batch vs real-time at scale
| Scale | Batch Cost/Prediction | Real-Time Cost/Prediction | Cost Multiplier |
|---|---|---|---|
| 1M predictions/day | $0.002 | $0.05 | 25x |
| 10M predictions/day | $0.001 | $0.03 | 30x |
| 100M predictions/day | $0.0005 | $0.015 | 30x |
| 1B predictions/day | $0.0002 | $0.01 | 50x |
Real-time serving costs 25-50x more per prediction due to feature stores, model servers, load balancers, auto-scaling, and 24/7 monitoring.
Batch predictions: the workhorse
Batch prediction is straightforward. You run your model on a schedule (hourly, daily, weekly), score all relevant entities, and store the results in a database or cache. When the application needs a prediction, it reads the pre-computed result.
The architecture is simple: a scheduled job (Airflow, dbt, cron) triggers a feature computation pipeline, feeds the features to the model, writes the predictions to a table, and the application reads from that table. Each component is well-understood, easy to monitor, and easy to debug.
Where batch works well: churn prediction (scored daily or weekly, actioned through marketing campaigns), demand forecasting (scored daily, used for inventory planning), lead scoring (scored when new leads enter the CRM), credit portfolio risk (scored monthly for regulatory reporting), and customer segmentation (scored weekly, used for targeted offers).
In every case, the prediction is consumed hours or days after computation. The data does not change meaningfully between scoring and action. Batch is the right architecture.
Batch advantages
Cost efficiency is the primary advantage. You run compute once and serve results from a database read. At scale, this means $0.001 per prediction versus $0.01-0.10 per real-time prediction. For a retailer scoring 50 million products daily, that is the difference between $50K/year and $500K-5M/year in serving costs alone.
Operational simplicity is the second advantage. If a batch job fails, you re-run it. If a prediction is wrong, you debug it offline. There is no pager going off at 3am because real-time inference latency spiked during a traffic surge.
Debugging is the third advantage. You can inspect every feature, every model input, and every prediction in a batch run. The entire pipeline is reproducible. Real-time systems are notoriously hard to debug because the state that produced a given prediction is ephemeral.
Real-time predictions: when seconds matter
Real-time prediction means the model runs at request time, using the latest available data, and returns a result within milliseconds. The architecture is fundamentally different: an API receives a request, retrieves features from a feature store (pre-computed) and computes on-demand features from streaming data, feeds them to a model served on GPU or CPU infrastructure, and returns the result within a latency budget (typically 50-200ms end to end).
Where real-time is necessary: fraud detection (transaction must be scored before the $150ms authorization window closes), dynamic pricing (hotel room price must reflect current demand when the user loads the page), content recommendations (user intent changes with every click), ad bidding (bid must be computed within the 100ms auction window), and credit decisioning at point of sale (approval within seconds).
The feature store problem
The hardest part of real-time ML is not model serving. It is feature serving. A fraud detection model might need features like "number of transactions by this card in the last 15 minutes," "average transaction amount for this merchant in the last hour," and "number of distinct merchants this card has used today."
These features require real-time aggregation over streaming data. You need a system that maintains sliding-window aggregates with sub-second freshness. Tools like Tecton, Feast, and Redis handle this, but they cost $100K-500K/year and require dedicated engineering teams to maintain.
Worse, you need to ensure that the features computed in real-time match exactly what the model saw during training. This is the training-serving skew problem, and it is the number-one source of silent prediction errors in production ML systems.
Here is a concrete example of how training-serving skew causes silent failures:
training_time_features — computed in batch SQL
| customer_id | txn_count_7d | avg_amount_7d | distinct_merchants_7d |
|---|---|---|---|
| C-5001 | 12 | $84.20 | 6 |
| C-5002 | 3 | $210.50 | 2 |
| C-5003 | 8 | $45.00 | 5 |
Training features computed via SQL window functions on the data warehouse. The 7-day window is calendar-precise: exactly 168 hours of data.
serving_time_features — computed by feature store in real time
| customer_id | txn_count_7d | avg_amount_7d | distinct_merchants_7d |
|---|---|---|---|
| C-5001 | 11 | $86.40 | 6 |
| C-5002 | 3 | $210.50 | 2 |
| C-5003 | 7 | $47.10 | 4 |
Serving features computed by streaming aggregation with 5-minute event lag. C-5001 is missing 1 recent transaction. C-5003 has a different count and average because a late-arriving event was included at training time but not at serving time.
prediction_impact — skew effect on fraud scores
| customer_id | training_score | serving_score | drift | consequence |
|---|---|---|---|---|
| C-5001 | 0.12 (safe) | 0.09 (safe) | -0.03 | Minor: still safe |
| C-5002 | 0.87 (fraud) | 0.87 (fraud) | 0.00 | None: exact match |
| C-5003 | 0.71 (flag) | 0.58 (borderline) | -0.13 | Missed: fraud not flagged |
C-5003's serving features differ enough to drop the fraud score below the alert threshold. The model learned that 8 transactions at 5 merchants in 7 days is suspicious, but serving-time data shows 7 at 4. A real fraud case slips through.
Infrastructure complexity
Real-time serving requires: load balancers to distribute requests, auto-scaling to handle traffic spikes (Black Friday, flash sales, market events), redundancy across availability zones, circuit breakers and fallback logic for when the model is slow or unavailable, and request queuing to handle burst traffic. A batch system needs none of this.
The cost difference is substantial. Stripe has reported spending over $100M annually on ML infrastructure, with a large portion dedicated to real-time fraud scoring. Netflix spends an estimated $50M/year on recommendation serving infrastructure. For most enterprises, real-time ML infrastructure costs 5-10x what batch infrastructure costs for the same number of predictions.
Batch predictions
- Run on schedule (hourly/daily/weekly)
- $0.001 per prediction at scale
- Simple pipeline: compute, store, read
- Easy to debug and reproduce
- Stale data (hours to days old)
Real-time predictions
- Run on demand (50-200ms latency)
- $0.01-0.10 per prediction at scale
- Complex: feature store, model server, load balancer
- Hard to debug ephemeral state
- Fresh data (seconds to minutes old)
The hybrid architecture
Most production systems use both. Netflix runs batch recommendations for the homepage (pre-computed nightly for all users) and real-time re-ranking when you start browsing (adjusting based on your current session). Uber runs batch demand forecasts for capacity planning and real-time surge pricing for current rides.
The pattern is consistent: use batch as the baseline and layer real-time on top for the moments where freshness creates measurable value. This hybrid approach captures 80-90% of the value of full real-time at 30-40% of the infrastructure cost.
The challenge with hybrid architectures is maintaining two separate serving paths. The batch path has its own feature pipeline, its own scheduling, and its own storage. The real-time path has its own feature store, its own serving infrastructure, and its own monitoring. Both need to produce consistent predictions, which means features must be computed identically in both paths.
PQL Query
-- Batch mode: score all customers nightly PREDICT churn_30d FOR EACH customers.customer_id -- Real-time mode: score one customer at request time PREDICT fraud_probability FOR transactions.txn_id = 'T-90042'
The same model and the same query language serve both batch and real-time modes. No separate feature pipelines, no training-serving skew, no dual architecture to maintain.
Output
| mode | entity | prediction | latency | feature_pipeline |
|---|---|---|---|---|
| Batch | 120K customers | churn_30d scores | 45 seconds total | None |
| Real-time | 1 transaction | fraud_probability: 0.87 | 120ms | None |
How foundation models simplify both modes
The complexity in both batch and real-time serving comes primarily from the feature layer. Feature engineering, feature stores, feature pipelines, training-serving consistency. Remove the feature layer, and the architecture simplifies dramatically.
A relational foundation model like KumoRFM reads raw relational data directly. There is no feature engineering step. The model connects to your data warehouse or database, understands the schema through foreign keys, and generates predictions from the raw table structure using graph transformer architecture.
For batch: point the model at your database, specify the prediction target, and score all entities. No feature pipeline to build, no feature store to maintain, no Airflow DAGs to debug.
For real-time: the model can serve predictions on individual entities as requests arrive. Because it reads directly from the data, there is no training-serving skew. The model sees the same data structure at training time and serving time.
The same model handles both modes. You are not maintaining two parallel architectures. One model, one data connection, two serving patterns.
Making the decision
Four questions determine whether a use case needs real-time predictions:
1. What is the decision window? If the decision happens within seconds of the triggering event (a transaction, a page load, an API call), you need real-time. If the decision happens hours or days later (a marketing campaign, a staffing plan, a risk report), batch is sufficient.
2. How fast does the input data change? If the features that drive the prediction change between scoring and action, you need real-time. A fraud model using "transactions in the last 5 minutes" is stale within minutes. A churn model using "purchases in the last 90 days" is fresh for days.
3. What is the value of freshness? Even when real-time is technically possible, the incremental accuracy gain may not justify the cost. If your batch churn model is 92% accurate and a real-time version would be 93%, the 1% improvement may not justify 5-10x the infrastructure spend.
4. Can you afford the operational complexity?Real-time ML requires 24/7 monitoring, on-call rotations, and rapid incident response. If your team is already stretched maintaining batch pipelines, adding real-time is likely to create more problems than it solves.
Start with batch. Prove the model delivers value. Then identify the specific use cases where freshness creates measurable uplift, and invest in real-time serving only for those. This sequence reduces risk, controls cost, and ensures that your real-time infrastructure investment is justified by demonstrated business impact.