What is predictive modeling?

Predictive modeling is the process of using statistical or machine learning algorithms to learn patterns from historical data and generate predictions about future outcomes. It encompasses the full pipeline: data preparation, feature engineering, model selection, training, evaluation, and deployment. Common techniques include logistic regression, decision trees, gradient boosting (XGBoost, LightGBM), neural networks, and more recently, foundation models for relational data.

Why is XGBoost still the most popular predictive modeling algorithm?

XGBoost dominates because it performs well on flat tabular data, trains fast, handles missing values natively, and is well-supported in every ML framework. It won the majority of Kaggle competitions between 2016 and 2022. Its limitation is that it requires a pre-engineered flat feature table, meaning it cannot learn from multi-table relational data directly. The 80% of effort spent on feature engineering exists to prepare data for models like XGBoost.

What is the difference between predictive modeling and predictive analytics?

Predictive modeling refers specifically to the technical process of building and training a statistical or ML model. Predictive analytics is the broader discipline that includes data collection, preparation, modeling, interpretation, and business application. Predictive modeling is one step within the larger predictive analytics workflow. In practice, the modeling step is often the fastest part, while data preparation takes 80% of the time.

How do foundation models differ from traditional predictive models?

Traditional predictive models (XGBoost, random forests, neural networks) are trained from scratch on a flat feature table for each specific task. Foundation models are pre-trained on large, diverse datasets and can generalize to new tasks without retraining. KumoRFM, a relational foundation model, was pre-trained on billions of rows across thousands of databases. It achieves 76.71 AUROC zero-shot on RelBench, outperforming task-specific LightGBM models at 62.44.

What is PQL (Predictive Query Language)?

PQL is a query language developed by Kumo.ai that lets you express prediction tasks on relational databases in a single line, similar to how SQL expresses data retrieval. Instead of writing hundreds of lines of feature engineering code, a PQL query like PREDICT customer.will_churn tells the foundation model what to predict. The model handles feature discovery, cross-table pattern extraction, and prediction generation automatically.

Predictive Modeling: From Logistic Regression to Foundation Models | Kumo.ai

Open any data science job posting and you will see the same stack: Python, SQL, XGBoost, maybe PyTorch. Open any production ML system at a Fortune 500 company and you will find the same architecture: a data pipeline that flattens relational tables into a feature table, a gradient boosting model trained on that table, and a serving layer that scores new records. This workflow has not meaningfully changed since 2015.

It works. XGBoost and LightGBM are genuinely excellent algorithms for flat tabular data. But the workflow has a cost that most teams accept as inevitable: 80% of project time is spent building the flat table, not building the model. A Stanford study measured this directly. Experienced data scientists spent an average of 12.3 hours and 878 lines of code per prediction task, and the vast majority of that effort went to feature engineering.

The model selection step, the part that XGBoost excels at, took a fraction of the total time. Teams are optimizing the fast part and ignoring the slow part.

What predictive modeling actually involves

Predictive modeling is the process of learning a function from historical data that maps inputs to a predicted output. Given a set of features about a customer (purchase history, account age, support interactions), predict whether they will churn in the next 30 days. Given features about a transaction (amount, merchant, time, device), predict whether it is fraudulent.

The full pipeline has five steps, and the time distribution is wildly uneven.

Step 1: Data collection and understanding (5-10% of time). Identify which tables contain relevant data, understand the schema, map foreign key relationships.

Step 2: Feature engineering (60-80% of time). Write SQL joins across multiple tables. Compute aggregations (count of orders in last 30 days, average order value, days since last purchase). Handle time windows, missing values, categorical encoding. Iterate on feature sets based on model performance.

Step 3: Model training (5-10% of time). Split data, train a model (usually XGBoost or LightGBM), evaluate on a holdout set. Often the fastest step.

Step 4: Hyperparameter tuning (5-10% of time). Grid search or Bayesian optimization over learning rate, tree depth, number of estimators.

Step 5: Deployment and monitoring (10-15% of time). Build a serving pipeline, monitor for data drift, retrain on a schedule.

Here is what this looks like with real data. A bank wants to predict loan defaults. The data lives across three tables.

applicants

applicant_id	name	income	employment	credit_score
A201	Rachel Torres	$78,000	Salaried	742
A202	Kevin Nguyen	$54,000	Self-employed	681
A203	Priya Sharma	$112,000	Salaried	789
A204	Marcus Hall	$61,000	Contract	654

loans

loan_id	applicant_id	amount	term	rate	status
L301	A201	$24,000	36 mo	6.2%	Current
L302	A201	$8,500	12 mo	5.8%	Paid off
L303	A202	$45,000	60 mo	9.1%	Current
L304	A204	$18,000	24 mo	11.4%	Late 30d
L305	A204	$6,200	12 mo	10.8%	Late 60d

Highlighted: Marcus has two loans, both delinquent. A flat feature table would show 'num_loans = 2, any_late = Yes' but miss that both are deteriorating simultaneously.

payment_history

payment_id	loan_id	due_date	paid_date	amount
PH01	L304	2025-10-01	2025-10-28	$812
PH02	L304	2025-11-01	2025-12-03	$812
PH03	L304	2025-12-01	---	$812
PH04	L305	2025-11-15	2025-12-20	$558
PH05	L305	2025-12-15	---	$558

Highlighted: missed payments. The gap between due_date and paid_date is widening, a temporal pattern that aggregation destroys.

The evolution of predictive modeling techniques

Era 1: Statistical models (1950s-2000s)

Logistic regression, linear regression, ARIMA for time series. These models are interpretable, fast to train, and still useful for problems with clear linear relationships and small feature sets. A logistic regression for credit scoring with 20 hand-picked features remains a production workhorse at many banks, partly because regulators require explainability.

Limitation: they assume linear relationships between features and outcomes. Real-world patterns are rarely linear.

Era 2: Ensemble methods (2000s-2020s)

Random forests, gradient boosting, and their optimized variants (XGBoost, LightGBM, CatBoost). This is the current workhorse era. XGBoost won 17 of 29 Kaggle competitions in 2015 alone. LightGBM, released by Microsoft in 2017, added speed improvements that made it practical for datasets with millions of rows.

These models handle non-linear relationships, feature interactions, and missing values gracefully. On a well-engineered flat feature table, they are hard to beat. The key phrase is "well-engineered." Their performance is bounded by the quality of the input features, and those features still come from manual engineering.

Era 3: Deep learning for tabular data (2020s)

TabNet, FT-Transformer, and other architectures attempted to bring deep learning's success on images and text to tabular data. Results were mixed. A 2022 paper by Grinsztajn, Oyallon, and Varoquaux at NeurIPS found that tree-based models still outperform deep learning on most tabular benchmarks when given the same flat features. Deep learning's advantage on images comes from learning features directly from raw pixels. On flat tables, the features are already engineered, so there is less for deep learning to discover.

Era 4: Foundation models for relational data (2024-present)

This is the shift that changes the equation. Instead of building a model that operates on a flat table, build a model that operates on the relational database directly. Represent the database as a temporal heterogeneous graph (rows are nodes, foreign keys are edges, timestamps create temporal ordering) and use graph transformers to learn patterns across the full structure.

The critical insight: deep learning underperformed on flat tables because the hard work (feature engineering) was already done. When you give deep learning access to the raw relational structure, it has something to learn that flat-table models cannot access.

The XGBoost ceiling

To understand why the field is moving beyond gradient boosting, look at the RelBench benchmark. RelBench, published at NeurIPS 2024, tests ML methods on 7 real-world relational databases across 30 prediction tasks with a total of 103 million rows.

A Stanford-trained data scientist manually engineered features for each task, then trained LightGBM. The result: 62.44 average AUROC across classification tasks. That is the ceiling for the traditional approach: a skilled human spending hours per task, feeding the best flat-table algorithm available.

A graph neural network trained directly on the relational structure, with no manual feature engineering, achieved 75.83 AUROC. Same data, same tasks, 13 points higher. The GNN found patterns that the data scientist could not enumerate: multi-hop relationships spanning 3-4 tables, temporal sequences within aggregation windows, and cross-entity propagation effects.

KumoRFM, a foundation model pre-trained across thousands of diverse relational databases, achieved 76.71 AUROC zero-shot, meaning it had never seen the test databases during training. With fine-tuning, it reached 81.14. Zero effort from the data scientist. Higher accuracy than the human-engineered approach.

Traditional workflow (2015-present)

Flatten relational tables into one wide table
Engineer 100-500 features per task manually
Train XGBoost/LightGBM on the flat table
878 lines of code, 12.3 hours per task
LightGBM: 62.44 AUROC on RelBench

Foundation model workflow

Point model at raw relational database
Describe prediction task in 1 line of PQL
Model reads multi-table structure directly
1 line of code, seconds per task
KumoRFM: 76.71 AUROC zero-shot on RelBench

Why the flat table assumption is the bottleneck

Every predictive modeling technique from Era 1 through Era 3 shares one assumption: the input is a flat table with one row per entity and one column per feature. This means someone (or some tool) must convert the relational database into that format before modeling can begin.

This conversion is lossy. Consider three types of signal that are destroyed by flattening.

Temporal sequences destroyed by aggregation

Consider two borrowers, both with 4 payments in their history. Same count. Completely different trajectories.

payment_history: Borrower X (recovering)

payment_id	loan_id	due_date	paid_date	days_late
PH-10	L401	2025-07-01	2025-07-28	27
PH-11	L401	2025-08-01	2025-08-18	17
PH-12	L401	2025-09-01	2025-09-08	7
PH-13	L401	2025-10-01	2025-10-02	1

Borrower X was 27 days late on the first payment, then 17, then 7, then 1. Clear recovery trajectory. This borrower is getting back on track.

payment_history: Borrower Y (deteriorating)

payment_id	loan_id	due_date	paid_date	days_late
PH-20	L402	2025-07-01	2025-07-03	2
PH-21	L402	2025-08-01	2025-08-10	9
PH-22	L402	2025-09-01	2025-09-19	18
PH-23	L402	2025-10-01	2025-10-31	30

Borrower Y started near-on-time, then slipped: 2, 9, 18, 30 days late. Accelerating delinquency. This borrower is heading toward default.

flat_feature_table (what XGBoost sees)

borrower	payment_count	avg_days_late	any_late_30d	reality
Borrower X	4	13.0	No	Recovering (low risk)
Borrower Y	4	14.8	Yes	Deteriorating (high risk)

Both borrowers have 4 payments and similar average lateness. The flat table shows almost identical profiles. The recovery vs. deterioration trajectory, the strongest predictor of default, is invisible.

Multi-hop relationships. A user's fraud risk might depend on the fraud history of other users who share the same device fingerprint, IP address, or shipping address. That signal spans three joins. Flat-table models cannot see it unless someone engineers it explicitly.

Graph topology. In a social network or transaction graph, the structure of connections matters. A user connected to a tight cluster of high-value customers behaves differently from one connected to a sparse, low-activity subgraph. Aggregating connection counts loses the topology.

What PQL looks like in practice

KumoRFM introduces Predictive Query Language (PQL), which expresses prediction tasks the way SQL expresses data retrieval. Instead of writing a 200-line feature engineering pipeline, you write one line.

PQL Query

PREDICT loans.status = 'Default'
FOR EACH applicants.applicant_id

One line replaces the entire feature engineering pipeline for loan default prediction. The model reads applicants, loans, and payment_history as a graph and discovers the cross-table patterns automatically.

Output

applicant_id	default_probability	top_signal
A201	0.04	Strong payment history, declining DTI
A202	0.31	High DTI, self-employed income volatility
A203	0.02	High income, excellent credit, no delinquency
A204	0.87	Widening payment gaps on both loans

Churn prediction: PREDICT customer.will_churn(30d). The model reads the full relational graph surrounding each customer node, discovers which cross-table patterns predict churn, and returns a probability per customer.

Demand forecasting: PREDICT product.units_sold(store, next_week). Same model, same database, different task. No new feature engineering.

The operational difference is profound. A data science team that delivers 4 predictive models per year (limited by feature engineering time) can evaluate dozens of prediction tasks per week. The bottleneck shifts from "can we build this model" to "which predictions create the most business value."

When to use which approach

The traditional workflow is not obsolete. If your data genuinely fits in a single table (sensor readings, pre-aggregated feature stores, clean data warehouse exports), XGBoost remains an excellent choice. It is fast, well-understood, and interpretable with SHAP values.

The foundation model approach becomes necessary when: your data spans multiple relational tables, you need predictions across many different tasks, temporal patterns matter, or your team cannot afford months of feature engineering per model. For most enterprise use cases, this describes the actual situation.

DoorDash applied this approach to their recommendation system and saw a 1.8% engagement lift across 30 million users. Databricks used it for lead scoring and achieved a 5.4x conversion lift. These results came from patterns in the relational structure that flat-table models could not access, regardless of how much feature engineering was applied.

The predictive modeling field spent a decade optimizing the algorithm (XGBoost, LightGBM, CatBoost, TabNet). The next decade will be about eliminating the data preparation step that makes the algorithm choice almost irrelevant. The model matters less than the data it can see. Foundation models see everything.

Key Takeaways

1Predictive modeling has evolved through four eras: statistical models, ensemble methods (XGBoost), deep learning on flat tables, and foundation models for relational data.
2XGBoost and LightGBM are excellent on flat tables, but their performance is capped by the quality of manually engineered features. On RelBench, LightGBM with manual features scores 62.44 AUROC.
3The flat-table assumption forces lossy compression: temporal payment sequences become averages, multi-loan deterioration patterns become counts, and cross-table signals disappear entirely.
4KumoRFM achieves 76.71 AUROC zero-shot on RelBench by learning directly from the relational graph, with no feature engineering and no task-specific training.
5The operational shift is from imperative (build a pipeline) to declarative (write a PQL query). Time per prediction task drops from 12.3 hours to under 1 second.

Predictive Modeling: From Logistic Regression to Foundation Models