Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
Learn16 min read

Kumo vs Building In-House with XGBoost

Most enterprise ML teams default to XGBoost or LightGBM on hand-crafted features. This approach requires 3-5 data scientists, 6-12 weeks per model, $50K-$1M per use case - and 53-88% of models never reach production. Kumo eliminates the entire pipeline.

TL;DR

  • 1On the SAP SALT enterprise benchmark, KumoRFM scores 91% accuracy vs 75% for PhD data scientists with XGBoost and 63% for LLM+AutoML - with zero feature engineering and zero training time.
  • 2The in-house XGBoost approach requires 3-5 data scientists, 6-12 weeks per model, and $50K-$1M per use case. Gartner and IDC estimate that 53-88% of these models never reach production - not because XGBoost is a bad algorithm, but because the surrounding pipeline is too expensive and fragile to sustain.
  • 3The 'flattening loss' is real and measurable: when 5+ tables are joined into 1 flat table for XGBoost, multi-hop relational signals disappear. On RelBench, LightGBM with manual features scores 62.44 AUROC vs KumoRFM zero-shot at 76.71 - a 14+ point gap.
  • 4Each in-house model accumulates 30% annual maintenance cost (Dimension Research). With 20 models, that is $200K-$600K/year in maintenance alone - before building anything new.
  • 5Over 5 years with 10 models, the in-house approach costs $6.5M-$11.5M. Kumo costs $800K-$1.2M. The gap widens every year as maintenance compounds.

XGBoost is the default. When an enterprise ML team gets a new prediction task - churn, fraud, lead scoring, demand forecasting - the playbook is almost always the same: pull data from the warehouse, join tables, engineer features, train XGBoost or LightGBM, tune hyperparameters, build a serving pipeline, deploy, monitor. It works. It has worked for a decade. And it is ruinously expensive.

The cost is not in the algorithm. XGBoost itself is open-source, fast, and well-understood. The cost is in everything around it: the 3-5 data scientists who spend 6-12 weeks per model manually engineering features, the serving infrastructure that must be built and maintained, the feature pipelines that break when upstream schemas change, and the 30% annual maintenance cost per model that quietly consumes your team's capacity to build anything new.

The result is predictable. Gartner and IDC estimate that 53-88% of ML models never reach production. Not because the models are inaccurate - but because the pipeline surrounding them is too expensive, too slow, and too fragile to sustain at scale. Most enterprise teams manage 3-5 models per year. The backlog of prediction tasks that would deliver value sits untouched.

Kumo works differently. Instead of building a pipeline around a gradient boosted model, KumoRFM reads raw relational tables directly and discovers predictive patterns across the full relational structure. No feature engineering, no model training, no serving infrastructure. The entire pipeline that consumes 80% of your team's time and budget does not exist.

The headline result: SAP SALT benchmark

Before diving into detailed comparisons, here is the result that matters most. The SAP SALT benchmark is an enterprise-grade evaluation where real business analysts and data scientists attempt prediction tasks on SAP enterprise data. It measures how accurately different approaches predict real business outcomes (customer behavior, demand patterns, operational metrics) on production-quality enterprise databases with multiple related tables.

sap_salt_enterprise_benchmark

approachaccuracywhat_it_means
LLM + AutoML63%Language model generates features, AutoML selects model
PhD Data Scientist + XGBoost75%Expert spends weeks hand-crafting features, tunes XGBoost
KumoRFM (zero-shot)91%No feature engineering, no training, reads relational tables directly

SAP SALT benchmark: KumoRFM outperforms expert data scientists by 16 percentage points and LLM+AutoML by 28 percentage points. Zero feature engineering. Zero training. The model reads raw enterprise tables and predicts.

This is not a marginal improvement. KumoRFM scores 91% where PhD-level data scientists with weeks of feature engineering and hand-tuned XGBoost score 75%. The 16 percentage point gap is the value of reading relational data natively instead of flattening it into a single table.

kumo_vs_xgboost_in_house_comparison

dimensionIn-House XGBoost/LightGBMKumo (KumoRFM)
Team required3-5 data scientists + ML engineers1 ML engineer or analyst
Time to first model6-12 weeks per use case~1 second (zero-shot) to minutes (fine-tuned)
Cost per use case$50K-$1M (labor + infrastructure)Marginal cost near zero per additional task
Feature engineeringManual - 12.3 hours and 878 lines of code per task (RelBench)Automatic - model discovers features from relational structure
Multi-table handlingManual SQL joins, flatten to one table, lose multi-hop signalsNative - reads multiple tables, preserves cross-table patterns
Accuracy on relational data62.44 AUROC (LightGBM + manual features, RelBench)76.71 AUROC zero-shot, 81.14 fine-tuned (RelBench)
ExplainabilitySHAP/feature importance on hand-crafted features onlyFeature importance across all discovered relational patterns
Annual maintenance per model30% of build cost (Dimension Research)Continuous retraining, no pipeline maintenance
Scale (models per year)3-5 models/year (pipeline-bottlenecked)50+ prediction tasks per quarter
Production success rate12-47% reach production (Gartner/IDC)100% - no pipeline to fail, predictions in seconds

Head-to-head comparison across 10 dimensions. XGBoost is a strong algorithm. The problem is everything around it: the team, the timeline, the maintenance, and the features it never sees.

The PhD ceiling: why even great data scientists miss 83-96% of the feature space

The conventional wisdom is that better data scientists build better features. This is true - but it understates the problem. The feature space for a relational database with 5+ tables is combinatorially vast. Every table join, every aggregation window, every interaction between columns creates potential features. A database with 5 tables, 50 columns, and 3 time windows has millions of possible engineered features.

In practice, even PhD-level data scientists explore only 4-17% of this space. They test hypotheses they think of: "customers who contacted support recently might churn," "high-value orders predict retention." These are reasonable hypotheses, and many of them work. But they are limited to what a human brain can conceive. The unknown unknowns - the three-hop relational patterns, the subtle temporal correlations, the interaction effects across table boundaries - remain undiscovered.

This is not a criticism of data scientists. It is a mathematical reality. A team of 5 data scientists spending 8 weeks on feature engineering will produce 50-200 features. The relational structure contains millions of potential signals. No amount of hiring closes this gap - it is a limitation of the hypothesis-driven approach itself.

The flattening loss: what disappears when you join 5 tables into 1

XGBoost requires a flat feature table: one row per entity, one column per feature. Enterprise data lives in relational databases: customers, orders, products, interactions, support tickets, payments - connected by foreign keys. To use XGBoost, someone must flatten this relational structure into a single table. This flattening is not just tedious. It is lossy.

When you join 5 tables into 1, multi-hop signals disappear. A customer's churn risk may depend on the satisfaction scores of other customers who bought the same products (customer → orders → products → other customers' reviews). This three-hop relationship cannot be captured in a flat table without explicitly engineering it - and you cannot engineer what you have not hypothesized.

Temporal sequences collapse into aggregates. A customer who made 5 purchases becomes purchase_count = 5. Whether those purchases were accelerating (1 per month, then 3 in a week) or declining (3 in the first month, then 2 over 5 months) disappears. Behavioral trajectories - the patterns that most strongly predict future behavior - are flattened into static snapshots.

flattening_loss_quantified

signal_typein relational structureafter flattening for XGBoostinformation_lost
Multi-hop relationshipsCustomer → orders → products → similar customers' outcomesNot captured unless manually engineeredHigh - 3+ hop patterns almost never engineered
Temporal sequencesFull event timeline with timestampsAggregates: count, sum, avg over fixed windowsHigh - acceleration, deceleration, periodicity lost
Entity interactionsWhich products bought together, which agents handled which ticketsCollapsed to per-entity aggregatesMedium - co-occurrence patterns lost
Graph neighborhoodWhat other entities share connections with this entityNot representedHigh - community and similarity signals lost
Conditional patternsBehavior changes after specific events (price increase, support ticket)Pre/post split only if manually engineeredMedium - event-conditional dynamics lost

The flattening loss is not theoretical. On RelBench, the gap between LightGBM on flattened data (62.44 AUROC) and KumoRFM on raw relational data (76.71 AUROC) is 14+ points. That gap is the information destroyed by flattening.

In-house XGBoost workflow

  • Hire 3-5 data scientists ($150K-$250K each)
  • Data scientist writes SQL to join 5+ tables (1-2 weeks)
  • Data scientist engineers 50-200 features (2-4 weeks, 878 lines of code)
  • Train XGBoost, tune hyperparameters (1-2 weeks)
  • Build serving infrastructure (2-4 weeks)
  • Deploy, monitor, maintain (30% annual cost per model)
  • Repeat for next use case - same team, same timeline

Kumo workflow

  • Connect Kumo to your data warehouse (one-time setup)
  • Write a PQL query defining what you want to predict
  • KumoRFM reads raw tables, discovers features, returns predictions
  • Zero feature engineering, zero model training, zero serving infrastructure
  • Time to first prediction: ~1 second (zero-shot)
  • No feature pipeline to maintain - ever
  • Next use case: write another PQL query (minutes, not months)

The maintenance trap: 30% annual cost that compounds silently

Building the first model is expensive. Maintaining it is what kills your budget. Dimension Research found that each production ML model accumulates approximately 30% of its original build cost annually in maintenance: retraining on new data, fixing feature pipeline breaks, updating for schema changes, monitoring for drift, and revalidating after upstream data changes.

For a single model built at $100K, that is $30K per year in maintenance. Manageable. But enterprises do not have one model. With 10 models, maintenance alone is $300K-$600K per year. With 20 models, it is $600K-$1.2M per year - before building anything new. Your data science team becomes a maintenance team, spending 60-80% of their time keeping existing models alive rather than building new ones.

This is why most enterprise ML teams plateau at 3-5 production models. Not because they lack ideas for new models, but because the maintenance burden on existing models consumes all available capacity. The backlog of prediction tasks that would deliver business value grows every quarter, untouched.

maintenance_cost_with_model_count

number_of_modelsannual_maintenance_cost_in_houseteam_capacity_consumed_by_maintenanceannual_maintenance_cost_kumo
5 models$75K-$150K/year30-40% of team~$10K/year
10 models$150K-$300K/year50-60% of team~$20K/year
20 models$300K-$600K/year70-80% of team~$40K/year
50 models$750K-$1.5M/year100%+ (need more headcount)~$100K/year

Highlighted: at 20 models, the in-house approach consumes 70-80% of your team's capacity in maintenance alone. With Kumo, there are no feature pipelines to maintain - the maintenance cost is platform monitoring only.

Benchmark results: RelBench

The RelBench benchmark provides an apples-to-apples comparison across 7 databases, 30 prediction tasks, and 103 million rows. These are real relational datasets from production-like schemas - not pre-flattened Kaggle tables - which is why the gap between approaches is so stark.

AUROC (Area Under the Receiver Operating Characteristic curve) measures how well a model distinguishes between positive and negative outcomes. An AUROC of 50 means random guessing. An AUROC of 100 means perfect prediction. In practice, moving from 65 to 77 AUROC is a significant improvement - it means the model correctly ranks a true positive above a true negative 77% of the time instead of 65%. For fraud detection, that difference can mean catching 40% more fraud with the same false positive rate. For churn prediction, it means identifying at-risk customers weeks earlier.

relbench_benchmark_results

approachAUROCfeature_engineering_timelines_of_codeproduction_risk
LightGBM + manual features62.4412.3 hours per task878High - 53-88% never reach production
XGBoost + manual features~63-6512.3 hours per task878High - same pipeline complexity
KumoRFM zero-shot76.71~1 second0None - no pipeline to fail
KumoRFM fine-tuned81.14Minutes0None - continuous retraining built-in

Highlighted: KumoRFM zero-shot outperforms in-house XGBoost/LightGBM by 14+ AUROC points with zero feature engineering and zero lines of code. The gap comes from patterns in the relational structure that a flat feature table destroys.

The 14+ AUROC point gap is not about XGBoost being a weak algorithm. XGBoost is excellent at what it does: building gradient boosted trees on tabular data. The gap is about what data XGBoost receives. After 5 tables are joined into 1, after temporal sequences are collapsed into aggregates, after multi-hop relationships are discarded - the remaining flat table simply does not contain the signals that predict the outcome. No amount of hyperparameter tuning or ensemble stacking recovers information that was destroyed before training began.

PQL Query

PREDICT churn_90d
FOR EACH customers.customer_id
WHERE customers.contract_value > 50000

One PQL query replaces the entire in-house pipeline: the SQL joins, the feature engineering code (878 lines), the XGBoost training, the hyperparameter tuning, the serving infrastructure, and the monitoring setup. KumoRFM reads the raw customers, orders, products, support_tickets, and payments tables directly.

Output

customer_idchurn_prob_kumochurn_prob_xgboostwhy_kumo_differs
C-72010.910.64Kumo detects 3-hop pattern: similar accounts churned after same support escalation sequence
C-72020.140.38Kumo correctly lower: multi-department adoption increasing across 4 product lines
C-72030.880.52Kumo detects declining purchase frequency + negative sentiment in support interactions
C-72040.060.09Both correctly low: healthy, expanding account

The cost comparison at scale

The accuracy gap matters. But for most enterprises, the cost gap is what changes the decision. Building and maintaining an in-house ML pipeline is not a one-time investment - it is a compounding annual commitment that grows with every model you deploy.

total_cost_of_ownership_year_1 (10 prediction tasks)

cost_dimensionIn-House XGBoostKumosavings
Data science team4 FTEs ($600K-$1M)0.5 FTE ($75K)$525K-$925K
Feature engineering labor123 hours per task x 10 ($307K)0 hours$307K
Infrastructure (training + serving)$100K-$200KIncluded in platform$100K-$200K
Kumo platform licenseN/A$80K-$120KN/A
Pipeline maintenance (30% per model)$150K-$300K$20K$130K-$280K
Models reaching production5-7 of 10 (53-88% failure rate)10 of 103-5 additional models in production
Total Year 1 cost$1.2M-$1.8M$160K-$240K~85% savings

Highlighted: Year 1 total cost comparison for 10 prediction tasks. The in-house approach costs 5-7x more - and delivers fewer models to production.

cumulative_cost_comparison_over_5_years (10 models)

time_horizonIn-House XGBoost (cumulative)Kumo (cumulative)cumulative_savings
Year 1$1.2M-$1.8M$160K-$240K$1.0M-$1.6M
Year 3$3.6M-$6.0M$480K-$720K$3.1M-$5.3M
Year 5$6.5M-$11.5M$800K-$1.2M$5.7M-$10.3M

Highlighted: over 5 years, maintenance costs compound dramatically for the in-house approach. Each model adds 30% annual maintenance cost. The gap widens every year - by Year 5, the in-house approach costs 8-10x more than Kumo.

When to build in-house with XGBoost

XGBoost is a powerful, well-understood algorithm. There are genuine scenarios where building in-house makes sense:

  • Your data is already flat and well-understood. If your prediction task uses a single table with known, validated features, XGBoost will perform well. No relational joins, no flattening loss. The algorithm itself is fast and accurate on clean tabular data.
  • You need custom model architectures. If your problem requires specialized loss functions, custom constraints, or domain-specific model modifications that only a hand-built pipeline can provide, in-house development gives you full control.
  • It is a one-off analysis, not a production system. For research, exploration, or one-time analyses where production deployment is not needed, a Jupyter notebook with XGBoost is fast and efficient. The pipeline costs only matter for production systems.
  • Your team has deep domain expertise with known features. If your data scientists have spent years building domain-specific features and know exactly which signals matter, their expertise is valuable. XGBoost on expert-crafted features can be competitive for narrow, well-understood problems.

When to choose Kumo

Kumo solves the pipeline problem, not the algorithm problem. Choose Kumo when:

  • Your data lives in multiple relational tables. Customers, orders, products, interactions, support tickets - if your predictive signals span table boundaries, Kumo discovers them automatically. The in-house approach requires manually flattening them, losing multi-hop signals in the process.
  • You cannot afford 6-12 weeks per model. When business conditions change quarterly, a 12-week development cycle means your model is outdated before it ships. KumoRFM delivers predictions in seconds, not months.
  • You want to scale beyond 3-5 models. The in-house approach plateaus because maintenance consumes team capacity. Kumo has no feature pipelines to maintain - going from 5 to 50 prediction tasks is writing 45 more PQL queries, not hiring 10 more data scientists.
  • You are tired of the production failure rate. If more than half your models die before reaching production, the problem is not the algorithm - it is the pipeline. Eliminating the pipeline eliminates the primary failure mode.
  • You need maximum accuracy on relational data. The 14+ AUROC point gap between XGBoost on flattened data and KumoRFM on raw relational data translates directly to business outcomes: more fraud caught, fewer false positives, better-targeted campaigns, lower churn. The accuracy comes from signals that flattening destroys.

Frequently asked questions

What is the main difference between Kumo and building in-house with XGBoost?

Building in-house with XGBoost requires a team of 3-5 data scientists to manually engineer features from relational data, train and tune gradient boosted models, build serving infrastructure, and maintain pipelines. The typical cost is $50K-$1M per use case with a 6-12 week timeline. Kumo uses a relational foundation model (KumoRFM) that reads raw relational tables directly, discovers predictive patterns across multiple tables, and returns predictions in seconds - with zero feature engineering, zero model training, and zero pipeline code.

Why do most in-house XGBoost models fail to reach production?

Gartner and IDC estimate that 53-88% of ML models never reach production. The primary reasons are not model quality but pipeline complexity: feature engineering takes 6-12 weeks, serving infrastructure requires additional engineering, feature drift demands constant monitoring, and each model accumulates 30% annual maintenance cost. By the time an in-house model is production-ready, business requirements have often changed. Kumo eliminates these failure modes by removing the pipeline entirely.

How does XGBoost accuracy compare to KumoRFM on relational data?

On the RelBench benchmark (7 databases, 30 tasks, 103M rows), LightGBM with manually engineered features achieves 62.44 AUROC. KumoRFM zero-shot achieves 76.71 AUROC - a 14+ point gap. The accuracy difference comes from the 'flattening loss': when data scientists join multiple tables into one flat table for XGBoost, multi-hop relational signals disappear. Even the best data scientists only explore 4-17% of the possible feature space, while KumoRFM discovers patterns across the full relational structure.

When should I build in-house with XGBoost instead of using Kumo?

XGBoost is a strong choice when your data is already in a single flat table, when your team has deep domain expertise that translates directly into known features, when you need custom loss functions or model architectures for very specialized problems, or when your prediction task has no relational structure (pure tabular data with no foreign keys). XGBoost also makes sense for one-off analyses where production deployment is not required.

How much does it cost to build and maintain ML models in-house vs using Kumo?

For 10 production ML models over 3 years, the in-house XGBoost approach costs approximately $3.6M-$6.0M including the data science team, infrastructure, and 30% annual maintenance per model (Dimension Research). Kumo costs approximately $480K-$720K over the same period - roughly 85% less. The difference is driven by eliminating feature engineering labor, model training cycles, serving infrastructure, and pipeline maintenance.

Can I migrate existing XGBoost models to Kumo?

Yes. Because Kumo reads raw relational tables directly, migration does not require rebuilding feature pipelines. You connect Kumo to your data warehouse (Snowflake, BigQuery, Databricks), define your prediction tasks in PQL (Predictive Query Language), and get predictions immediately. The feature engineering code, training pipelines, and serving infrastructure you maintained for XGBoost become unnecessary. Many organizations run Kumo in parallel with existing models to validate accuracy before fully migrating.

See it in action

KumoRFM delivers predictions on relational data in seconds. No feature engineering, no ML pipelines. Try it free.