What is the main difference between Kumo and building in-house with XGBoost?

Building in-house with XGBoost requires a team of 3-5 data scientists to manually engineer features from relational data, train and tune gradient boosted models, build serving infrastructure, and maintain pipelines. The typical cost is $50K-$1M per use case with a 6-12 week timeline. Kumo uses a relational foundation model (KumoRFM) that reads raw relational tables directly, discovers predictive patterns across multiple tables, and returns predictions in seconds - with zero feature engineering, zero model training, and zero pipeline code.

Why do most in-house XGBoost models fail to reach production?

Gartner and IDC estimate that 53-88% of ML models never reach production. The primary reasons are not model quality but pipeline complexity: feature engineering takes 6-12 weeks, serving infrastructure requires additional engineering, feature drift demands constant monitoring, and each model accumulates 30% annual maintenance cost. By the time an in-house model is production-ready, business requirements have often changed. Kumo eliminates these failure modes by removing the pipeline entirely.

How does XGBoost accuracy compare to KumoRFM on relational data?

On the RelBench benchmark (7 databases, 30 tasks, 103M rows), LightGBM with manually engineered features achieves 62.44 AUROC. KumoRFM zero-shot achieves 76.71 AUROC - a 14+ point gap. The accuracy difference comes from the 'flattening loss': when data scientists join multiple tables into one flat table for XGBoost, multi-hop relational signals disappear. Even the best data scientists only explore 4-17% of the possible feature space, while KumoRFM discovers patterns across the full relational structure.

When should I build in-house with XGBoost instead of using Kumo?

XGBoost is a strong choice when your data is already in a single flat table, when your team has deep domain expertise that translates directly into known features, when you need custom loss functions or model architectures for very specialized problems, or when your prediction task has no relational structure (pure tabular data with no foreign keys). XGBoost also makes sense for one-off analyses where production deployment is not required.

How much does it cost to build and maintain ML models in-house vs using Kumo?

For 10 production ML models over 3 years, the in-house XGBoost approach costs approximately $3.6M-$6.0M including the data science team, infrastructure, and 30% annual maintenance per model (Dimension Research). Kumo costs approximately $480K-$720K over the same period - roughly 85% less. The difference is driven by eliminating feature engineering labor, model training cycles, serving infrastructure, and pipeline maintenance.

Can I migrate existing XGBoost models to Kumo?

Yes. Because Kumo reads raw relational tables directly, migration does not require rebuilding feature pipelines. You connect Kumo to your data warehouse (Snowflake, BigQuery, Databricks), define your prediction tasks in PQL (Predictive Query Language), and get predictions immediately. The feature engineering code, training pipelines, and serving infrastructure you maintained for XGBoost become unnecessary. Many organizations run Kumo in parallel with existing models to validate accuracy before fully migrating.

Kumo vs Building In-House with XGBoost: Foundation Model vs Manual ML | Kumo.ai

XGBoost is the default. When an enterprise ML team gets a new prediction task - churn, fraud, lead scoring, demand forecasting - the playbook is almost always the same: pull data from the warehouse, join tables, engineer features, train XGBoost or LightGBM, tune hyperparameters, build a serving pipeline, deploy, monitor. It works. It has worked for a decade. And it is ruinously expensive.

The cost is not in the algorithm. XGBoost itself is open-source, fast, and well-understood. The cost is in everything around it: the 3-5 data scientists who spend 6-12 weeks per model manually engineering features, the serving infrastructure that must be built and maintained, the feature pipelines that break when upstream schemas change, and the 30% annual maintenance cost per model that quietly consumes your team's capacity to build anything new.

The result is predictable. Gartner and IDC estimate that 53-88% of ML models never reach production. Not because the models are inaccurate - but because the pipeline surrounding them is too expensive, too slow, and too fragile to sustain at scale. Most enterprise teams manage 3-5 models per year. The backlog of prediction tasks that would deliver value sits untouched.

Kumo works differently. Instead of building a pipeline around a gradient boosted model, KumoRFM reads raw relational tables directly and discovers predictive patterns across the full relational structure. No feature engineering, no model training, no serving infrastructure. The entire pipeline that consumes 80% of your team's time and budget does not exist.

The headline result: SAP SALT benchmark

Before diving into detailed comparisons, here is the result that matters most. The SAP SALT benchmark is an enterprise-grade evaluation where real business analysts and data scientists attempt prediction tasks on SAP enterprise data. It measures how accurately different approaches predict real business outcomes (customer behavior, demand patterns, operational metrics) on production-quality enterprise databases with multiple related tables.

sap_salt_enterprise_benchmark

approach	accuracy	what_it_means
LLM + AutoML	63%	Language model generates features, AutoML selects model
PhD Data Scientist + XGBoost	75%	Expert spends weeks hand-crafting features, tunes XGBoost
KumoRFM (zero-shot)	91%	No feature engineering, no training, reads relational tables directly

SAP SALT benchmark: KumoRFM outperforms expert data scientists by 16 percentage points and LLM+AutoML by 28 percentage points. Zero feature engineering. Zero training. The model reads raw enterprise tables and predicts.

This is not a marginal improvement. KumoRFM scores 91% where PhD-level data scientists with weeks of feature engineering and hand-tuned XGBoost score 75%. The 16 percentage point gap is the value of reading relational data natively instead of flattening it into a single table.

kumo_vs_xgboost_in_house_comparison

dimension	In-House XGBoost/LightGBM	Kumo (KumoRFM)
Team required	3-5 data scientists + ML engineers	1 ML engineer or analyst
Time to first model	6-12 weeks per use case	~1 second (zero-shot) to minutes (fine-tuned)
Cost per use case	$50K-$1M (labor + infrastructure)	Marginal cost near zero per additional task
Feature engineering	Manual - 12.3 hours and 878 lines of code per task (RelBench)	Automatic - model discovers features from relational structure
Multi-table handling	Manual SQL joins, flatten to one table, lose multi-hop signals	Native - reads multiple tables, preserves cross-table patterns
Accuracy on relational data	62.44 AUROC (LightGBM + manual features, RelBench)	76.71 AUROC zero-shot, 81.14 fine-tuned (RelBench)
Explainability	SHAP/feature importance on hand-crafted features only	Feature importance across all discovered relational patterns
Annual maintenance per model	30% of build cost (Dimension Research)	Continuous retraining, no pipeline maintenance
Scale (models per year)	3-5 models/year (pipeline-bottlenecked)	50+ prediction tasks per quarter
Production success rate	12-47% reach production (Gartner/IDC)	100% - no pipeline to fail, predictions in seconds

Head-to-head comparison across 10 dimensions. XGBoost is a strong algorithm. The problem is everything around it: the team, the timeline, the maintenance, and the features it never sees.

The PhD ceiling: why even great data scientists miss 83-96% of the feature space

The conventional wisdom is that better data scientists build better features. This is true - but it understates the problem. The feature space for a relational database with 5+ tables is combinatorially vast. Every table join, every aggregation window, every interaction between columns creates potential features. A database with 5 tables, 50 columns, and 3 time windows has millions of possible engineered features.

In practice, even PhD-level data scientists explore only 4-17% of this space. They test hypotheses they think of: "customers who contacted support recently might churn," "high-value orders predict retention." These are reasonable hypotheses, and many of them work. But they are limited to what a human brain can conceive. The unknown unknowns - the three-hop relational patterns, the subtle temporal correlations, the interaction effects across table boundaries - remain undiscovered.

This is not a criticism of data scientists. It is a mathematical reality. A team of 5 data scientists spending 8 weeks on feature engineering will produce 50-200 features. The relational structure contains millions of potential signals. No amount of hiring closes this gap - it is a limitation of the hypothesis-driven approach itself.

The flattening loss: what disappears when you join 5 tables into 1

XGBoost requires a flat feature table: one row per entity, one column per feature. Enterprise data lives in relational databases: customers, orders, products, interactions, support tickets, payments - connected by foreign keys. To use XGBoost, someone must flatten this relational structure into a single table. This flattening is not just tedious. It is lossy.

When you join 5 tables into 1, multi-hop signals disappear. A customer's churn risk may depend on the satisfaction scores of other customers who bought the same products (customer → orders → products → other customers' reviews). This three-hop relationship cannot be captured in a flat table without explicitly engineering it - and you cannot engineer what you have not hypothesized.

Temporal sequences collapse into aggregates. A customer who made 5 purchases becomes purchase_count = 5. Whether those purchases were accelerating (1 per month, then 3 in a week) or declining (3 in the first month, then 2 over 5 months) disappears. Behavioral trajectories - the patterns that most strongly predict future behavior - are flattened into static snapshots.

flattening_loss_quantified

signal_type	in relational structure	after flattening for XGBoost	information_lost
Multi-hop relationships	Customer → orders → products → similar customers' outcomes	Not captured unless manually engineered	High - 3+ hop patterns almost never engineered
Temporal sequences	Full event timeline with timestamps	Aggregates: count, sum, avg over fixed windows	High - acceleration, deceleration, periodicity lost
Entity interactions	Which products bought together, which agents handled which tickets	Collapsed to per-entity aggregates	Medium - co-occurrence patterns lost
Graph neighborhood	What other entities share connections with this entity	Not represented	High - community and similarity signals lost
Conditional patterns	Behavior changes after specific events (price increase, support ticket)	Pre/post split only if manually engineered	Medium - event-conditional dynamics lost

The flattening loss is not theoretical. On RelBench, the gap between LightGBM on flattened data (62.44 AUROC) and KumoRFM on raw relational data (76.71 AUROC) is 14+ points. That gap is the information destroyed by flattening.

In-house XGBoost workflow

Hire 3-5 data scientists ($150K-$250K each)
Data scientist writes SQL to join 5+ tables (1-2 weeks)
Data scientist engineers 50-200 features (2-4 weeks, 878 lines of code)
Train XGBoost, tune hyperparameters (1-2 weeks)
Build serving infrastructure (2-4 weeks)
Deploy, monitor, maintain (30% annual cost per model)
Repeat for next use case - same team, same timeline

Kumo workflow

Connect Kumo to your data warehouse (one-time setup)
Write a PQL query defining what you want to predict
KumoRFM reads raw tables, discovers features, returns predictions
Zero feature engineering, zero model training, zero serving infrastructure
Time to first prediction: ~1 second (zero-shot)
No feature pipeline to maintain - ever
Next use case: write another PQL query (minutes, not months)

The maintenance trap: 30% annual cost that compounds silently

Building the first model is expensive. Maintaining it is what kills your budget. Dimension Research found that each production ML model accumulates approximately 30% of its original build cost annually in maintenance: retraining on new data, fixing feature pipeline breaks, updating for schema changes, monitoring for drift, and revalidating after upstream data changes.

For a single model built at $100K, that is $30K per year in maintenance. Manageable. But enterprises do not have one model. With 10 models, maintenance alone is $300K-$600K per year. With 20 models, it is $600K-$1.2M per year - before building anything new. Your data science team becomes a maintenance team, spending 60-80% of their time keeping existing models alive rather than building new ones.

This is why most enterprise ML teams plateau at 3-5 production models. Not because they lack ideas for new models, but because the maintenance burden on existing models consumes all available capacity. The backlog of prediction tasks that would deliver business value grows every quarter, untouched.

maintenance_cost_with_model_count

number_of_models	annual_maintenance_cost_in_house	team_capacity_consumed_by_maintenance	annual_maintenance_cost_kumo
5 models	$75K-$150K/year	30-40% of team	~$10K/year
10 models	$150K-$300K/year	50-60% of team	~$20K/year
20 models	$300K-$600K/year	70-80% of team	~$40K/year
50 models	$750K-$1.5M/year	100%+ (need more headcount)	~$100K/year

Highlighted: at 20 models, the in-house approach consumes 70-80% of your team's capacity in maintenance alone. With Kumo, there are no feature pipelines to maintain - the maintenance cost is platform monitoring only.

Benchmark results: RelBench

The RelBench benchmark provides an apples-to-apples comparison across 7 databases, 30 prediction tasks, and 103 million rows. These are real relational datasets from production-like schemas - not pre-flattened Kaggle tables - which is why the gap between approaches is so stark.

AUROC (Area Under the Receiver Operating Characteristic curve) measures how well a model distinguishes between positive and negative outcomes. An AUROC of 50 means random guessing. An AUROC of 100 means perfect prediction. In practice, moving from 65 to 77 AUROC is a significant improvement - it means the model correctly ranks a true positive above a true negative 77% of the time instead of 65%. For fraud detection, that difference can mean catching 40% more fraud with the same false positive rate. For churn prediction, it means identifying at-risk customers weeks earlier.

relbench_benchmark_results

approach	AUROC	feature_engineering_time	lines_of_code	production_risk
LightGBM + manual features	62.44	12.3 hours per task	878	High - 53-88% never reach production
XGBoost + manual features	~63-65	12.3 hours per task	878	High - same pipeline complexity
KumoRFM zero-shot	76.71	~1 second	0	None - no pipeline to fail
KumoRFM fine-tuned	81.14	Minutes	0	None - continuous retraining built-in

Highlighted: KumoRFM zero-shot outperforms in-house XGBoost/LightGBM by 14+ AUROC points with zero feature engineering and zero lines of code. The gap comes from patterns in the relational structure that a flat feature table destroys.

The 14+ AUROC point gap is not about XGBoost being a weak algorithm. XGBoost is excellent at what it does: building gradient boosted trees on tabular data. The gap is about what data XGBoost receives. After 5 tables are joined into 1, after temporal sequences are collapsed into aggregates, after multi-hop relationships are discarded - the remaining flat table simply does not contain the signals that predict the outcome. No amount of hyperparameter tuning or ensemble stacking recovers information that was destroyed before training began.

PQL Query

PREDICT churn_90d
FOR EACH customers.customer_id
WHERE customers.contract_value > 50000

One PQL query replaces the entire in-house pipeline: the SQL joins, the feature engineering code (878 lines), the XGBoost training, the hyperparameter tuning, the serving infrastructure, and the monitoring setup. KumoRFM reads the raw customers, orders, products, support_tickets, and payments tables directly.

Output

customer_id	churn_prob_kumo	churn_prob_xgboost	why_kumo_differs
C-7201	0.91	0.64	Kumo detects 3-hop pattern: similar accounts churned after same support escalation sequence
C-7202	0.14	0.38	Kumo correctly lower: multi-department adoption increasing across 4 product lines
C-7203	0.88	0.52	Kumo detects declining purchase frequency + negative sentiment in support interactions
C-7204	0.06	0.09	Both correctly low: healthy, expanding account

The cost comparison at scale

The accuracy gap matters. But for most enterprises, the cost gap is what changes the decision. Building and maintaining an in-house ML pipeline is not a one-time investment - it is a compounding annual commitment that grows with every model you deploy.

total_cost_of_ownership_year_1 (10 prediction tasks)

cost_dimension	In-House XGBoost	Kumo	savings
Data science team	4 FTEs ($600K-$1M)	0.5 FTE ($75K)	$525K-$925K
Feature engineering labor	123 hours per task x 10 ($307K)	0 hours	$307K
Infrastructure (training + serving)	$100K-$200K	Included in platform	$100K-$200K
Kumo platform license	N/A	$80K-$120K	N/A
Pipeline maintenance (30% per model)	$150K-$300K	$20K	$130K-$280K
Models reaching production	5-7 of 10 (53-88% failure rate)	10 of 10	3-5 additional models in production
Total Year 1 cost	$1.2M-$1.8M	$160K-$240K	~85% savings

Highlighted: Year 1 total cost comparison for 10 prediction tasks. The in-house approach costs 5-7x more - and delivers fewer models to production.

cumulative_cost_comparison_over_5_years (10 models)

time_horizon	In-House XGBoost (cumulative)	Kumo (cumulative)	cumulative_savings
Year 1	$1.2M-$1.8M	$160K-$240K	$1.0M-$1.6M
Year 3	$3.6M-$6.0M	$480K-$720K	$3.1M-$5.3M
Year 5	$6.5M-$11.5M	$800K-$1.2M	$5.7M-$10.3M

Highlighted: over 5 years, maintenance costs compound dramatically for the in-house approach. Each model adds 30% annual maintenance cost. The gap widens every year - by Year 5, the in-house approach costs 8-10x more than Kumo.

When to build in-house with XGBoost

XGBoost is a powerful, well-understood algorithm. There are genuine scenarios where building in-house makes sense:

Your data is already flat and well-understood. If your prediction task uses a single table with known, validated features, XGBoost will perform well. No relational joins, no flattening loss. The algorithm itself is fast and accurate on clean tabular data.
You need custom model architectures. If your problem requires specialized loss functions, custom constraints, or domain-specific model modifications that only a hand-built pipeline can provide, in-house development gives you full control.
It is a one-off analysis, not a production system. For research, exploration, or one-time analyses where production deployment is not needed, a Jupyter notebook with XGBoost is fast and efficient. The pipeline costs only matter for production systems.
Your team has deep domain expertise with known features. If your data scientists have spent years building domain-specific features and know exactly which signals matter, their expertise is valuable. XGBoost on expert-crafted features can be competitive for narrow, well-understood problems.

When to choose Kumo

Kumo solves the pipeline problem, not the algorithm problem. Choose Kumo when:

Your data lives in multiple relational tables. Customers, orders, products, interactions, support tickets - if your predictive signals span table boundaries, Kumo discovers them automatically. The in-house approach requires manually flattening them, losing multi-hop signals in the process.
You cannot afford 6-12 weeks per model. When business conditions change quarterly, a 12-week development cycle means your model is outdated before it ships. KumoRFM delivers predictions in seconds, not months.
You want to scale beyond 3-5 models. The in-house approach plateaus because maintenance consumes team capacity. Kumo has no feature pipelines to maintain - going from 5 to 50 prediction tasks is writing 45 more PQL queries, not hiring 10 more data scientists.
You are tired of the production failure rate. If more than half your models die before reaching production, the problem is not the algorithm - it is the pipeline. Eliminating the pipeline eliminates the primary failure mode.
You need maximum accuracy on relational data. The 14+ AUROC point gap between XGBoost on flattened data and KumoRFM on raw relational data translates directly to business outcomes: more fraud caught, fewer false positives, better-targeted campaigns, lower churn. The accuracy comes from signals that flattening destroys.

Key Takeaways

1The in-house XGBoost approach costs $50K-$1M per use case, requires 3-5 data scientists, takes 6-12 weeks per model, and 53-88% of models never reach production. The problem is not XGBoost - it is the pipeline around it: feature engineering, serving infrastructure, and compounding maintenance.
2Even the best data scientists explore only 4-17% of the possible feature space - the 'PhD ceiling.' KumoRFM, pre-trained on tens of thousands of relational datasets, discovers patterns across the full relational structure that no human would hypothesize.
3The 'flattening loss' costs 14+ AUROC points: LightGBM on flattened data scores 62.44 vs KumoRFM zero-shot at 76.71 on RelBench. When 5 tables are joined into 1, multi-hop signals, temporal sequences, and entity interactions are destroyed before training begins.
4Maintenance compounds silently: 30% annual cost per model (Dimension Research). With 20 models, maintenance alone costs $200K-$600K/year - consuming 70-80% of team capacity and leaving no bandwidth for new models.
5Over 5 years with 10 models, the in-house approach costs $6.5M-$11.5M vs $800K-$1.2M for Kumo. The gap widens every year as maintenance accumulates. Kumo does not compete with XGBoost on gradient boosting - it eliminates the need for the entire pipeline that XGBoost requires.

Kumo vs Building In-House with XGBoost