Open any data science job posting and you will see the same stack: Python, SQL, XGBoost, maybe PyTorch. Open any production ML system at a Fortune 500 company and you will find the same architecture: a data pipeline that flattens relational tables into a feature table, a gradient boosting model trained on that table, and a serving layer that scores new records. This workflow has not meaningfully changed since 2015.
It works. XGBoost and LightGBM are genuinely excellent algorithms for flat tabular data. But the workflow has a cost that most teams accept as inevitable: 80% of project time is spent building the flat table, not building the model. A Stanford study measured this directly. Experienced data scientists spent an average of 12.3 hours and 878 lines of code per prediction task, and the vast majority of that effort went to feature engineering.
The model selection step, the part that XGBoost excels at, took a fraction of the total time. Teams are optimizing the fast part and ignoring the slow part.
What predictive modeling actually involves
Predictive modeling is the process of learning a function from historical data that maps inputs to a predicted output. Given a set of features about a customer (purchase history, account age, support interactions), predict whether they will churn in the next 30 days. Given features about a transaction (amount, merchant, time, device), predict whether it is fraudulent.
The full pipeline has five steps, and the time distribution is wildly uneven.
Step 1: Data collection and understanding (5-10% of time). Identify which tables contain relevant data, understand the schema, map foreign key relationships.
Step 2: Feature engineering (60-80% of time). Write SQL joins across multiple tables. Compute aggregations (count of orders in last 30 days, average order value, days since last purchase). Handle time windows, missing values, categorical encoding. Iterate on feature sets based on model performance.
Step 3: Model training (5-10% of time). Split data, train a model (usually XGBoost or LightGBM), evaluate on a holdout set. Often the fastest step.
Step 4: Hyperparameter tuning (5-10% of time). Grid search or Bayesian optimization over learning rate, tree depth, number of estimators.
Step 5: Deployment and monitoring (10-15% of time). Build a serving pipeline, monitor for data drift, retrain on a schedule.
Here is what this looks like with real data. A bank wants to predict loan defaults. The data lives across three tables.
applicants
| applicant_id | name | income | employment | credit_score |
|---|---|---|---|---|
| A201 | Rachel Torres | $78,000 | Salaried | 742 |
| A202 | Kevin Nguyen | $54,000 | Self-employed | 681 |
| A203 | Priya Sharma | $112,000 | Salaried | 789 |
| A204 | Marcus Hall | $61,000 | Contract | 654 |
loans
| loan_id | applicant_id | amount | term | rate | status |
|---|---|---|---|---|---|
| L301 | A201 | $24,000 | 36 mo | 6.2% | Current |
| L302 | A201 | $8,500 | 12 mo | 5.8% | Paid off |
| L303 | A202 | $45,000 | 60 mo | 9.1% | Current |
| L304 | A204 | $18,000 | 24 mo | 11.4% | Late 30d |
| L305 | A204 | $6,200 | 12 mo | 10.8% | Late 60d |
Highlighted: Marcus has two loans, both delinquent. A flat feature table would show 'num_loans = 2, any_late = Yes' but miss that both are deteriorating simultaneously.
payment_history
| payment_id | loan_id | due_date | paid_date | amount |
|---|---|---|---|---|
| PH01 | L304 | 2025-10-01 | 2025-10-28 | $812 |
| PH02 | L304 | 2025-11-01 | 2025-12-03 | $812 |
| PH03 | L304 | 2025-12-01 | --- | $812 |
| PH04 | L305 | 2025-11-15 | 2025-12-20 | $558 |
| PH05 | L305 | 2025-12-15 | --- | $558 |
Highlighted: missed payments. The gap between due_date and paid_date is widening, a temporal pattern that aggregation destroys.
The evolution of predictive modeling techniques
Era 1: Statistical models (1950s-2000s)
Logistic regression, linear regression, ARIMA for time series. These models are interpretable, fast to train, and still useful for problems with clear linear relationships and small feature sets. A logistic regression for credit scoring with 20 hand-picked features remains a production workhorse at many banks, partly because regulators require explainability.
Limitation: they assume linear relationships between features and outcomes. Real-world patterns are rarely linear.
Era 2: Ensemble methods (2000s-2020s)
Random forests, gradient boosting, and their optimized variants (XGBoost, LightGBM, CatBoost). This is the current workhorse era. XGBoost won 17 of 29 Kaggle competitions in 2015 alone. LightGBM, released by Microsoft in 2017, added speed improvements that made it practical for datasets with millions of rows.
These models handle non-linear relationships, feature interactions, and missing values gracefully. On a well-engineered flat feature table, they are hard to beat. The key phrase is "well-engineered." Their performance is bounded by the quality of the input features, and those features still come from manual engineering.
Era 3: Deep learning for tabular data (2020s)
TabNet, FT-Transformer, and other architectures attempted to bring deep learning's success on images and text to tabular data. Results were mixed. A 2022 paper by Grinsztajn, Oyallon, and Varoquaux at NeurIPS found that tree-based models still outperform deep learning on most tabular benchmarks when given the same flat features. Deep learning's advantage on images comes from learning features directly from raw pixels. On flat tables, the features are already engineered, so there is less for deep learning to discover.
Era 4: Foundation models for relational data (2024-present)
This is the shift that changes the equation. Instead of building a model that operates on a flat table, build a model that operates on the relational database directly. Represent the database as a temporal heterogeneous graph (rows are nodes, foreign keys are edges, timestamps create temporal ordering) and use graph transformers to learn patterns across the full structure.
The critical insight: deep learning underperformed on flat tables because the hard work (feature engineering) was already done. When you give deep learning access to the raw relational structure, it has something to learn that flat-table models cannot access.
The XGBoost ceiling
To understand why the field is moving beyond gradient boosting, look at the RelBench benchmark. RelBench, published at NeurIPS 2024, tests ML methods on 7 real-world relational databases across 30 prediction tasks with a total of 103 million rows.
A Stanford-trained data scientist manually engineered features for each task, then trained LightGBM. The result: 62.44 average AUROC across classification tasks. That is the ceiling for the traditional approach: a skilled human spending hours per task, feeding the best flat-table algorithm available.
A graph neural network trained directly on the relational structure, with no manual feature engineering, achieved 75.83 AUROC. Same data, same tasks, 13 points higher. The GNN found patterns that the data scientist could not enumerate: multi-hop relationships spanning 3-4 tables, temporal sequences within aggregation windows, and cross-entity propagation effects.
KumoRFM, a foundation model pre-trained across thousands of diverse relational databases, achieved 76.71 AUROC zero-shot, meaning it had never seen the test databases during training. With fine-tuning, it reached 81.14. Zero effort from the data scientist. Higher accuracy than the human-engineered approach.
Traditional workflow (2015-present)
- Flatten relational tables into one wide table
- Engineer 100-500 features per task manually
- Train XGBoost/LightGBM on the flat table
- 878 lines of code, 12.3 hours per task
- LightGBM: 62.44 AUROC on RelBench
Foundation model workflow
- Point model at raw relational database
- Describe prediction task in 1 line of PQL
- Model reads multi-table structure directly
- 1 line of code, seconds per task
- KumoRFM: 76.71 AUROC zero-shot on RelBench
Why the flat table assumption is the bottleneck
Every predictive modeling technique from Era 1 through Era 3 shares one assumption: the input is a flat table with one row per entity and one column per feature. This means someone (or some tool) must convert the relational database into that format before modeling can begin.
This conversion is lossy. Consider three types of signal that are destroyed by flattening.
Temporal sequences destroyed by aggregation
Consider two borrowers, both with 4 payments in their history. Same count. Completely different trajectories.
payment_history: Borrower X (recovering)
| payment_id | loan_id | due_date | paid_date | days_late |
|---|---|---|---|---|
| PH-10 | L401 | 2025-07-01 | 2025-07-28 | 27 |
| PH-11 | L401 | 2025-08-01 | 2025-08-18 | 17 |
| PH-12 | L401 | 2025-09-01 | 2025-09-08 | 7 |
| PH-13 | L401 | 2025-10-01 | 2025-10-02 | 1 |
Borrower X was 27 days late on the first payment, then 17, then 7, then 1. Clear recovery trajectory. This borrower is getting back on track.
payment_history: Borrower Y (deteriorating)
| payment_id | loan_id | due_date | paid_date | days_late |
|---|---|---|---|---|
| PH-20 | L402 | 2025-07-01 | 2025-07-03 | 2 |
| PH-21 | L402 | 2025-08-01 | 2025-08-10 | 9 |
| PH-22 | L402 | 2025-09-01 | 2025-09-19 | 18 |
| PH-23 | L402 | 2025-10-01 | 2025-10-31 | 30 |
Borrower Y started near-on-time, then slipped: 2, 9, 18, 30 days late. Accelerating delinquency. This borrower is heading toward default.
flat_feature_table (what XGBoost sees)
| borrower | payment_count | avg_days_late | any_late_30d | reality |
|---|---|---|---|---|
| Borrower X | 4 | 13.0 | No | Recovering (low risk) |
| Borrower Y | 4 | 14.8 | Yes | Deteriorating (high risk) |
Both borrowers have 4 payments and similar average lateness. The flat table shows almost identical profiles. The recovery vs. deterioration trajectory, the strongest predictor of default, is invisible.
Multi-hop relationships. A user's fraud risk might depend on the fraud history of other users who share the same device fingerprint, IP address, or shipping address. That signal spans three joins. Flat-table models cannot see it unless someone engineers it explicitly.
Graph topology. In a social network or transaction graph, the structure of connections matters. A user connected to a tight cluster of high-value customers behaves differently from one connected to a sparse, low-activity subgraph. Aggregating connection counts loses the topology.
What PQL looks like in practice
KumoRFM introduces Predictive Query Language (PQL), which expresses prediction tasks the way SQL expresses data retrieval. Instead of writing a 200-line feature engineering pipeline, you write one line.
PQL Query
PREDICT loans.status = 'Default' FOR EACH applicants.applicant_id
One line replaces the entire feature engineering pipeline for loan default prediction. The model reads applicants, loans, and payment_history as a graph and discovers the cross-table patterns automatically.
Output
| applicant_id | default_probability | top_signal |
|---|---|---|
| A201 | 0.04 | Strong payment history, declining DTI |
| A202 | 0.31 | High DTI, self-employed income volatility |
| A203 | 0.02 | High income, excellent credit, no delinquency |
| A204 | 0.87 | Widening payment gaps on both loans |
Churn prediction: PREDICT customer.will_churn(30d). The model reads the full relational graph surrounding each customer node, discovers which cross-table patterns predict churn, and returns a probability per customer.
Demand forecasting: PREDICT product.units_sold(store, next_week). Same model, same database, different task. No new feature engineering.
The operational difference is profound. A data science team that delivers 4 predictive models per year (limited by feature engineering time) can evaluate dozens of prediction tasks per week. The bottleneck shifts from "can we build this model" to "which predictions create the most business value."
When to use which approach
The traditional workflow is not obsolete. If your data genuinely fits in a single table (sensor readings, pre-aggregated feature stores, clean data warehouse exports), XGBoost remains an excellent choice. It is fast, well-understood, and interpretable with SHAP values.
The foundation model approach becomes necessary when: your data spans multiple relational tables, you need predictions across many different tasks, temporal patterns matter, or your team cannot afford months of feature engineering per model. For most enterprise use cases, this describes the actual situation.
DoorDash applied this approach to their recommendation system and saw a 1.8% engagement lift across 30 million users. Databricks used it for lead scoring and achieved a 5.4x conversion lift. These results came from patterns in the relational structure that flat-table models could not access, regardless of how much feature engineering was applied.
The predictive modeling field spent a decade optimizing the algorithm (XGBoost, LightGBM, CatBoost, TabNet). The next decade will be about eliminating the data preparation step that makes the algorithm choice almost irrelevant. The model matters less than the data it can see. Foundation models see everything.