XGBoost is the default. When an enterprise ML team gets a new prediction task - churn, fraud, lead scoring, demand forecasting - the playbook is almost always the same: pull data from the warehouse, join tables, engineer features, train XGBoost or LightGBM, tune hyperparameters, build a serving pipeline, deploy, monitor. It works. It has worked for a decade. And it is ruinously expensive.
The cost is not in the algorithm. XGBoost itself is open-source, fast, and well-understood. The cost is in everything around it: the 3-5 data scientists who spend 6-12 weeks per model manually engineering features, the serving infrastructure that must be built and maintained, the feature pipelines that break when upstream schemas change, and the 30% annual maintenance cost per model that quietly consumes your team's capacity to build anything new.
The result is predictable. Gartner and IDC estimate that 53-88% of ML models never reach production. Not because the models are inaccurate - but because the pipeline surrounding them is too expensive, too slow, and too fragile to sustain at scale. Most enterprise teams manage 3-5 models per year. The backlog of prediction tasks that would deliver value sits untouched.
Kumo works differently. Instead of building a pipeline around a gradient boosted model, KumoRFM reads raw relational tables directly and discovers predictive patterns across the full relational structure. No feature engineering, no model training, no serving infrastructure. The entire pipeline that consumes 80% of your team's time and budget does not exist.
The headline result: SAP SALT benchmark
Before diving into detailed comparisons, here is the result that matters most. The SAP SALT benchmark is an enterprise-grade evaluation where real business analysts and data scientists attempt prediction tasks on SAP enterprise data. It measures how accurately different approaches predict real business outcomes (customer behavior, demand patterns, operational metrics) on production-quality enterprise databases with multiple related tables.
sap_salt_enterprise_benchmark
| approach | accuracy | what_it_means |
|---|---|---|
| LLM + AutoML | 63% | Language model generates features, AutoML selects model |
| PhD Data Scientist + XGBoost | 75% | Expert spends weeks hand-crafting features, tunes XGBoost |
| KumoRFM (zero-shot) | 91% | No feature engineering, no training, reads relational tables directly |
SAP SALT benchmark: KumoRFM outperforms expert data scientists by 16 percentage points and LLM+AutoML by 28 percentage points. Zero feature engineering. Zero training. The model reads raw enterprise tables and predicts.
This is not a marginal improvement. KumoRFM scores 91% where PhD-level data scientists with weeks of feature engineering and hand-tuned XGBoost score 75%. The 16 percentage point gap is the value of reading relational data natively instead of flattening it into a single table.
kumo_vs_xgboost_in_house_comparison
| dimension | In-House XGBoost/LightGBM | Kumo (KumoRFM) |
|---|---|---|
| Team required | 3-5 data scientists + ML engineers | 1 ML engineer or analyst |
| Time to first model | 6-12 weeks per use case | ~1 second (zero-shot) to minutes (fine-tuned) |
| Cost per use case | $50K-$1M (labor + infrastructure) | Marginal cost near zero per additional task |
| Feature engineering | Manual - 12.3 hours and 878 lines of code per task (RelBench) | Automatic - model discovers features from relational structure |
| Multi-table handling | Manual SQL joins, flatten to one table, lose multi-hop signals | Native - reads multiple tables, preserves cross-table patterns |
| Accuracy on relational data | 62.44 AUROC (LightGBM + manual features, RelBench) | 76.71 AUROC zero-shot, 81.14 fine-tuned (RelBench) |
| Explainability | SHAP/feature importance on hand-crafted features only | Feature importance across all discovered relational patterns |
| Annual maintenance per model | 30% of build cost (Dimension Research) | Continuous retraining, no pipeline maintenance |
| Scale (models per year) | 3-5 models/year (pipeline-bottlenecked) | 50+ prediction tasks per quarter |
| Production success rate | 12-47% reach production (Gartner/IDC) | 100% - no pipeline to fail, predictions in seconds |
Head-to-head comparison across 10 dimensions. XGBoost is a strong algorithm. The problem is everything around it: the team, the timeline, the maintenance, and the features it never sees.
The PhD ceiling: why even great data scientists miss 83-96% of the feature space
The conventional wisdom is that better data scientists build better features. This is true - but it understates the problem. The feature space for a relational database with 5+ tables is combinatorially vast. Every table join, every aggregation window, every interaction between columns creates potential features. A database with 5 tables, 50 columns, and 3 time windows has millions of possible engineered features.
In practice, even PhD-level data scientists explore only 4-17% of this space. They test hypotheses they think of: "customers who contacted support recently might churn," "high-value orders predict retention." These are reasonable hypotheses, and many of them work. But they are limited to what a human brain can conceive. The unknown unknowns - the three-hop relational patterns, the subtle temporal correlations, the interaction effects across table boundaries - remain undiscovered.
This is not a criticism of data scientists. It is a mathematical reality. A team of 5 data scientists spending 8 weeks on feature engineering will produce 50-200 features. The relational structure contains millions of potential signals. No amount of hiring closes this gap - it is a limitation of the hypothesis-driven approach itself.
The flattening loss: what disappears when you join 5 tables into 1
XGBoost requires a flat feature table: one row per entity, one column per feature. Enterprise data lives in relational databases: customers, orders, products, interactions, support tickets, payments - connected by foreign keys. To use XGBoost, someone must flatten this relational structure into a single table. This flattening is not just tedious. It is lossy.
When you join 5 tables into 1, multi-hop signals disappear. A customer's churn risk may depend on the satisfaction scores of other customers who bought the same products (customer → orders → products → other customers' reviews). This three-hop relationship cannot be captured in a flat table without explicitly engineering it - and you cannot engineer what you have not hypothesized.
Temporal sequences collapse into aggregates. A customer who made 5 purchases becomes purchase_count = 5. Whether those purchases were accelerating (1 per month, then 3 in a week) or declining (3 in the first month, then 2 over 5 months) disappears. Behavioral trajectories - the patterns that most strongly predict future behavior - are flattened into static snapshots.
flattening_loss_quantified
| signal_type | in relational structure | after flattening for XGBoost | information_lost |
|---|---|---|---|
| Multi-hop relationships | Customer → orders → products → similar customers' outcomes | Not captured unless manually engineered | High - 3+ hop patterns almost never engineered |
| Temporal sequences | Full event timeline with timestamps | Aggregates: count, sum, avg over fixed windows | High - acceleration, deceleration, periodicity lost |
| Entity interactions | Which products bought together, which agents handled which tickets | Collapsed to per-entity aggregates | Medium - co-occurrence patterns lost |
| Graph neighborhood | What other entities share connections with this entity | Not represented | High - community and similarity signals lost |
| Conditional patterns | Behavior changes after specific events (price increase, support ticket) | Pre/post split only if manually engineered | Medium - event-conditional dynamics lost |
The flattening loss is not theoretical. On RelBench, the gap between LightGBM on flattened data (62.44 AUROC) and KumoRFM on raw relational data (76.71 AUROC) is 14+ points. That gap is the information destroyed by flattening.
In-house XGBoost workflow
- Hire 3-5 data scientists ($150K-$250K each)
- Data scientist writes SQL to join 5+ tables (1-2 weeks)
- Data scientist engineers 50-200 features (2-4 weeks, 878 lines of code)
- Train XGBoost, tune hyperparameters (1-2 weeks)
- Build serving infrastructure (2-4 weeks)
- Deploy, monitor, maintain (30% annual cost per model)
- Repeat for next use case - same team, same timeline
Kumo workflow
- Connect Kumo to your data warehouse (one-time setup)
- Write a PQL query defining what you want to predict
- KumoRFM reads raw tables, discovers features, returns predictions
- Zero feature engineering, zero model training, zero serving infrastructure
- Time to first prediction: ~1 second (zero-shot)
- No feature pipeline to maintain - ever
- Next use case: write another PQL query (minutes, not months)
The maintenance trap: 30% annual cost that compounds silently
Building the first model is expensive. Maintaining it is what kills your budget. Dimension Research found that each production ML model accumulates approximately 30% of its original build cost annually in maintenance: retraining on new data, fixing feature pipeline breaks, updating for schema changes, monitoring for drift, and revalidating after upstream data changes.
For a single model built at $100K, that is $30K per year in maintenance. Manageable. But enterprises do not have one model. With 10 models, maintenance alone is $300K-$600K per year. With 20 models, it is $600K-$1.2M per year - before building anything new. Your data science team becomes a maintenance team, spending 60-80% of their time keeping existing models alive rather than building new ones.
This is why most enterprise ML teams plateau at 3-5 production models. Not because they lack ideas for new models, but because the maintenance burden on existing models consumes all available capacity. The backlog of prediction tasks that would deliver business value grows every quarter, untouched.
maintenance_cost_with_model_count
| number_of_models | annual_maintenance_cost_in_house | team_capacity_consumed_by_maintenance | annual_maintenance_cost_kumo |
|---|---|---|---|
| 5 models | $75K-$150K/year | 30-40% of team | ~$10K/year |
| 10 models | $150K-$300K/year | 50-60% of team | ~$20K/year |
| 20 models | $300K-$600K/year | 70-80% of team | ~$40K/year |
| 50 models | $750K-$1.5M/year | 100%+ (need more headcount) | ~$100K/year |
Highlighted: at 20 models, the in-house approach consumes 70-80% of your team's capacity in maintenance alone. With Kumo, there are no feature pipelines to maintain - the maintenance cost is platform monitoring only.
Benchmark results: RelBench
The RelBench benchmark provides an apples-to-apples comparison across 7 databases, 30 prediction tasks, and 103 million rows. These are real relational datasets from production-like schemas - not pre-flattened Kaggle tables - which is why the gap between approaches is so stark.
AUROC (Area Under the Receiver Operating Characteristic curve) measures how well a model distinguishes between positive and negative outcomes. An AUROC of 50 means random guessing. An AUROC of 100 means perfect prediction. In practice, moving from 65 to 77 AUROC is a significant improvement - it means the model correctly ranks a true positive above a true negative 77% of the time instead of 65%. For fraud detection, that difference can mean catching 40% more fraud with the same false positive rate. For churn prediction, it means identifying at-risk customers weeks earlier.
relbench_benchmark_results
| approach | AUROC | feature_engineering_time | lines_of_code | production_risk |
|---|---|---|---|---|
| LightGBM + manual features | 62.44 | 12.3 hours per task | 878 | High - 53-88% never reach production |
| XGBoost + manual features | ~63-65 | 12.3 hours per task | 878 | High - same pipeline complexity |
| KumoRFM zero-shot | 76.71 | ~1 second | 0 | None - no pipeline to fail |
| KumoRFM fine-tuned | 81.14 | Minutes | 0 | None - continuous retraining built-in |
Highlighted: KumoRFM zero-shot outperforms in-house XGBoost/LightGBM by 14+ AUROC points with zero feature engineering and zero lines of code. The gap comes from patterns in the relational structure that a flat feature table destroys.
The 14+ AUROC point gap is not about XGBoost being a weak algorithm. XGBoost is excellent at what it does: building gradient boosted trees on tabular data. The gap is about what data XGBoost receives. After 5 tables are joined into 1, after temporal sequences are collapsed into aggregates, after multi-hop relationships are discarded - the remaining flat table simply does not contain the signals that predict the outcome. No amount of hyperparameter tuning or ensemble stacking recovers information that was destroyed before training began.
PQL Query
PREDICT churn_90d FOR EACH customers.customer_id WHERE customers.contract_value > 50000
One PQL query replaces the entire in-house pipeline: the SQL joins, the feature engineering code (878 lines), the XGBoost training, the hyperparameter tuning, the serving infrastructure, and the monitoring setup. KumoRFM reads the raw customers, orders, products, support_tickets, and payments tables directly.
Output
| customer_id | churn_prob_kumo | churn_prob_xgboost | why_kumo_differs |
|---|---|---|---|
| C-7201 | 0.91 | 0.64 | Kumo detects 3-hop pattern: similar accounts churned after same support escalation sequence |
| C-7202 | 0.14 | 0.38 | Kumo correctly lower: multi-department adoption increasing across 4 product lines |
| C-7203 | 0.88 | 0.52 | Kumo detects declining purchase frequency + negative sentiment in support interactions |
| C-7204 | 0.06 | 0.09 | Both correctly low: healthy, expanding account |
The cost comparison at scale
The accuracy gap matters. But for most enterprises, the cost gap is what changes the decision. Building and maintaining an in-house ML pipeline is not a one-time investment - it is a compounding annual commitment that grows with every model you deploy.
total_cost_of_ownership_year_1 (10 prediction tasks)
| cost_dimension | In-House XGBoost | Kumo | savings |
|---|---|---|---|
| Data science team | 4 FTEs ($600K-$1M) | 0.5 FTE ($75K) | $525K-$925K |
| Feature engineering labor | 123 hours per task x 10 ($307K) | 0 hours | $307K |
| Infrastructure (training + serving) | $100K-$200K | Included in platform | $100K-$200K |
| Kumo platform license | N/A | $80K-$120K | N/A |
| Pipeline maintenance (30% per model) | $150K-$300K | $20K | $130K-$280K |
| Models reaching production | 5-7 of 10 (53-88% failure rate) | 10 of 10 | 3-5 additional models in production |
| Total Year 1 cost | $1.2M-$1.8M | $160K-$240K | ~85% savings |
Highlighted: Year 1 total cost comparison for 10 prediction tasks. The in-house approach costs 5-7x more - and delivers fewer models to production.
cumulative_cost_comparison_over_5_years (10 models)
| time_horizon | In-House XGBoost (cumulative) | Kumo (cumulative) | cumulative_savings |
|---|---|---|---|
| Year 1 | $1.2M-$1.8M | $160K-$240K | $1.0M-$1.6M |
| Year 3 | $3.6M-$6.0M | $480K-$720K | $3.1M-$5.3M |
| Year 5 | $6.5M-$11.5M | $800K-$1.2M | $5.7M-$10.3M |
Highlighted: over 5 years, maintenance costs compound dramatically for the in-house approach. Each model adds 30% annual maintenance cost. The gap widens every year - by Year 5, the in-house approach costs 8-10x more than Kumo.
When to build in-house with XGBoost
XGBoost is a powerful, well-understood algorithm. There are genuine scenarios where building in-house makes sense:
- Your data is already flat and well-understood. If your prediction task uses a single table with known, validated features, XGBoost will perform well. No relational joins, no flattening loss. The algorithm itself is fast and accurate on clean tabular data.
- You need custom model architectures. If your problem requires specialized loss functions, custom constraints, or domain-specific model modifications that only a hand-built pipeline can provide, in-house development gives you full control.
- It is a one-off analysis, not a production system. For research, exploration, or one-time analyses where production deployment is not needed, a Jupyter notebook with XGBoost is fast and efficient. The pipeline costs only matter for production systems.
- Your team has deep domain expertise with known features. If your data scientists have spent years building domain-specific features and know exactly which signals matter, their expertise is valuable. XGBoost on expert-crafted features can be competitive for narrow, well-understood problems.
When to choose Kumo
Kumo solves the pipeline problem, not the algorithm problem. Choose Kumo when:
- Your data lives in multiple relational tables. Customers, orders, products, interactions, support tickets - if your predictive signals span table boundaries, Kumo discovers them automatically. The in-house approach requires manually flattening them, losing multi-hop signals in the process.
- You cannot afford 6-12 weeks per model. When business conditions change quarterly, a 12-week development cycle means your model is outdated before it ships. KumoRFM delivers predictions in seconds, not months.
- You want to scale beyond 3-5 models. The in-house approach plateaus because maintenance consumes team capacity. Kumo has no feature pipelines to maintain - going from 5 to 50 prediction tasks is writing 45 more PQL queries, not hiring 10 more data scientists.
- You are tired of the production failure rate. If more than half your models die before reaching production, the problem is not the algorithm - it is the pipeline. Eliminating the pipeline eliminates the primary failure mode.
- You need maximum accuracy on relational data. The 14+ AUROC point gap between XGBoost on flattened data and KumoRFM on raw relational data translates directly to business outcomes: more fraud caught, fewer false positives, better-targeted campaigns, lower churn. The accuracy comes from signals that flattening destroys.