Every few years, a new paradigm arrives and the discourse swings to extremes. LLMs will replace everything. Traditional ML is dead. The truth is always more specific.
Foundation models for structured data are real and they are good. KumoRFM zero-shot outperforms manually-engineered LightGBM on the RelBench benchmark by 14 AUROC points. But traditional ML is not dead. There are cases where a well-tuned XGBoost model on a carefully engineered feature table is the right answer.
The question is not which is better in the abstract. It is which is better for your specific data, team, and problem. This article walks through the actual differences with real numbers so you can make that call.
relbench_benchmark_results
| approach | AUROC (classification) | time_per_task | feature_engineering |
|---|---|---|---|
| LightGBM + manual features | 62.44 | 12.3 hours | 878 lines of code |
| LLM (Llama 3.2 3B) | 68.06 | Minutes | None (text serialization) |
| Supervised GNN (RDL) | 75.83 | ~30 min training | None |
| KumoRFM zero-shot | 76.71 | <1 second | None |
| KumoRFM fine-tuned | 81.14 | Minutes fine-tuning | None |
Highlighted: KumoRFM zero-shot outperforms all approaches without any task-specific training. Fine-tuning adds another 4.4 AUROC points. The 14-point gap between LightGBM and KumoRFM is the cost of information loss during flattening.
when_to_use_which
| scenario | best_approach | why |
|---|---|---|
| Single flat table, mature features | Traditional ML | No multi-table structure to exploit |
| Multi-table, one high-value task | RDL (train GNN) | Worth the training investment |
| Multi-table, many tasks | Foundation model | Zero-shot across all tasks |
| Regulatory interpretability required | Traditional ML | Linear models are easier to audit |
| Speed to first prediction critical | Foundation model | Seconds vs weeks |
| Rapid prototyping / exploration | Foundation model | Test 40 questions in a day |
The choice depends on your data structure, number of tasks, and time constraints. Foundation models have the largest advantage when data spans multiple tables and you need predictions fast.
What traditional ML does well
Traditional ML (gradient boosted trees, logistic regression, random forests) has earned its place. For the right problem, it is fast, interpretable, and well-understood.
Single-table problems
If your prediction task lives in a single table with clean, well-understood features, traditional ML is hard to beat on efficiency. A credit scoring model built on a flat table of 50 pre-computed features does not benefit from multi-table graph learning because there is no multi-table structure to exploit.
In Kaggle competitions on single-table datasets, gradient boosted trees win most tabular benchmarks. This is not surprising. These models were specifically designed for flat tabular data, and they have had 20 years of optimization for that format.
Well-engineered feature sets
If your company has invested years building a mature feature store with hundreds of carefully curated features, those features encode valuable domain knowledge. A traditional model trained on these features benefits from both statistical learning and human insight.
The challenge is the upfront cost: that feature store took years to build, and it is expensive to maintain. But if it already exists, the marginal cost of training another model on it is low.
Regulatory interpretability
In regulated industries like banking and insurance, model interpretability is sometimes a legal requirement. A logistic regression or small decision tree with 20 features is easy to explain. Every feature has a known coefficient or split point. Regulators can audit it. Customers can receive explanations.
Foundation models are not black boxes (feature importance and attention weights can be extracted), but the interpretability story is less mature than for linear models.
What foundation models change
Foundation models for relational data change three things fundamentally: how data is consumed, how long predictions take, and how many tasks a single model can handle.
1. Multi-table data without flattening
Enterprise databases are relational. Customers connect to orders connect to products connect to reviews connect to other customers. Traditional ML requires flattening this into a single table, which means choosing which tables to join, which columns to aggregate, and which time windows to apply. A Stanford study measured this at 12.3 hours and 878 lines of code per task.
relational_tables (e-commerce database)
| table | rows | example_columns | foreign_keys |
|---|---|---|---|
| customers | 500K | customer_id, segment, signup_date | — |
| orders | 12M | order_id, customer_id, total, date | customer_id |
| products | 80K | product_id, category, brand, price | — |
| order_items | 35M | item_id, order_id, product_id, qty | order_id, product_id |
| reviews | 4M | review_id, customer_id, product_id, stars | customer_id, product_id |
| support | 1.2M | ticket_id, customer_id, category, resolved_hrs | customer_id |
Six tables, 52M+ rows, multiple join paths. To predict churn, traditional ML must flatten this into one row per customer. The foundation model reads all six tables directly.
flat_feature_table (what traditional ML produces from 6 tables)
| customer_id | orders_30d | avg_order_val | categories_bought | avg_review_stars | tickets_30d | days_since_order |
|---|---|---|---|---|---|---|
| C-101 | 4 | $67.30 | 5 | 4.2 | 0 | 3 |
| C-102 | 0 | $0 | 0 | — | 3 | 74 |
| C-103 | 6 | $45.80 | 8 | 3.8 | 1 | 1 |
52 million rows across 6 tables compressed into 7 columns per customer. The multi-hop pattern 'C-102 gave 1-star reviews to products that other high-value customers also reviewed poorly' is gone. The temporal escalation of C-102's support tickets is gone. Only aggregates survive.
Foundation models skip this step entirely. KumoRFM represents the database as a temporal heterogeneous graph and learns patterns directly from the raw relational structure. No flattening, no feature engineering, no information loss.
The impact on accuracy is significant. On RelBench classification tasks, LightGBM with manual features scores 62.44 AUROC. KumoRFM zero-shot scores 76.71. The gap is not due to a better model architecture. It is due to the model seeing the full relational structure instead of a lossy summary.
2. Zero-shot prediction
Traditional ML requires training a new model for every prediction task. Want to predict churn? Train a model. Upsell? Train another. Fraud? Another. Each model needs its own feature engineering, data pipeline, training run, and deployment.
A foundation model is pre-trained. It has learned universal patterns (recency, frequency, temporal dynamics, graph topology) from thousands of diverse databases. At inference time, you point it at your data and ask a question. No training required.
This changes the economics of ML. Instead of a 2-month project per prediction task, you get answers in seconds. A team that could build 4 models per quarter can now explore 40 prediction questions in a day.
3. Cross-task generalization
Traditional ML models are narrow. A churn model knows nothing about fraud. A recommendation model knows nothing about demand forecasting. Each model starts from scratch.
Foundation models transfer knowledge across tasks. The patterns that predict churn (declining engagement, increasing support load, behavioral shifts) are structurally similar to the patterns that predict other outcomes. A model that has seen these patterns across thousands of databases recognizes them in yours immediately.
Traditional ML
- Requires flat feature table as input
- 12.3 hours of feature engineering per task
- One model per prediction task
- Strong on single-table, well-featured data
- Mature interpretability tooling
Foundation model (KumoRFM)
- Reads raw relational tables directly
- Zero feature engineering required
- One model handles any prediction task
- 14+ AUROC points better on multi-table data
- Seconds to first prediction, not weeks
The benchmark evidence
RelBench is the standard benchmark for ML on relational data. It includes 7 databases, 30 prediction tasks, and over 103 million rows. The results tell a clear story:
| Approach | AUROC (classification) | Time per task |
|---|---|---|
| LightGBM + manual features | 62.44 | 12.3 hours |
| LLM (Llama 3.2 3B) | 68.06 | Minutes (but poor accuracy) |
| Supervised GNN (RDL) | 75.83 | ~30 minutes training |
| KumoRFM zero-shot | 76.71 | <1 second |
| KumoRFM fine-tuned | 81.14 | Minutes of fine-tuning |
Two things stand out. First, the gap between manual features (62.44) and the graph-based approaches (75-81) is large. This is the cost of information loss during flattening. Second, KumoRFM zero-shot (no task-specific training) already outperforms a supervised GNN that was trained specifically for each task.
PQL Query
PREDICT upsell_probability FOR EACH customers.customer_id WHERE customers.plan = 'basic'
With a foundation model, a new prediction task is a new query, not a new project. The same model that predicts churn also predicts upsell, fraud, demand, and any other relational question. No feature engineering, no retraining.
Output
| customer_id | upsell_prob | recommended_plan | top_signal |
|---|---|---|---|
| C-8801 | 0.89 | Enterprise | Usage exceeds plan limits 3x/week |
| C-8802 | 0.72 | Pro | Team size grew from 3 to 8 in 60 days |
| C-8803 | 0.31 | Pro | Moderate usage, no growth signals |
| C-8804 | 0.14 | Stay Basic | Low engagement, price-sensitive segment |
What stays the same
Foundation models do not change everything. Several fundamentals of ML remain exactly the same.
Data quality still matters
A foundation model that reads garbage tables produces garbage predictions. Missing values, incorrect timestamps, broken foreign keys, duplicated rows, and stale data all degrade performance. The model is better at handling messy data than manual feature engineering (which often breaks on edge cases), but it is not magic.
Problem framing still matters
Choosing the right prediction target, the right entity, the right time horizon, and the right evaluation metric still requires human judgment. A foundation model can predict churn in 1 second, but deciding whether to predict 30-day churn or 90-day churn, and what to do about the predictions, is still a business decision.
Deployment still matters
Getting predictions into production systems, monitoring model performance, handling drift, and integrating with downstream workflows are engineering problems that exist regardless of how the model was built.
Evaluation still matters
You still need holdout sets, proper temporal splits, and meaningful metrics. A foundation model makes it faster to get predictions, but you still need to verify they are good before acting on them.
Decision framework
Here is a practical framework for choosing between the two approaches.
Use traditional ML when
- Your data lives in a single table with pre-computed features
- You have a mature feature store that already covers this use case
- Strict regulatory interpretability requirements apply (linear models only)
- The prediction task is simple, well-understood, and has been stable for years
Use a foundation model when
- Your data spans multiple relational tables
- You need predictions fast (days, not months)
- You have many prediction tasks and cannot afford to build custom pipelines for each one
- Your data science team spends most of their time on feature engineering instead of business problems
- You suspect there are predictive signals in cross-table relationships that your current features do not capture
Use both when
- You have existing high-value models in production that work well, and you want to use foundation models for new prediction tasks or rapid prototyping
- You want to benchmark your current approach against a foundation model to quantify the accuracy gap
- You are migrating incrementally, keeping proven models while building new ones with the foundation model approach
The real shift
The shift from traditional ML to foundation models for structured data is not about better algorithms. It is about removing a structural bottleneck.
For two decades, the ML pipeline has required converting relational data into flat tables. This conversion is lossy (multi-hop patterns disappear), slow (12.3 hours per task), and scales linearly with the number of prediction tasks. Every new question means another round of feature engineering.
Foundation models remove that conversion step. They read relational data directly. The result is faster predictions, higher accuracy on multi-table data, and a fundamentally different cost structure where the marginal cost of a new prediction task approaches zero.
Traditional ML is not dead. But the set of problems where it is the best answer is shrinking. If your data is relational and your bottleneck is feature engineering, the foundation model approach is not just faster. It is better.