Is XGBoost still the best model for tabular data?

For single flat tables, XGBoost and LightGBM remain highly competitive. A 2024 meta-analysis across 170 tabular benchmarks showed gradient-boosted trees within 1-2% of the best neural networks on single-table tasks. But for multi-table relational data, graph-based models outperform XGBoost by 13+ AUROC points on RelBench because they capture cross-table patterns that flat models cannot access.

What is the difference between tabular data and relational data?

Tabular data is a single flat table: one row per entity, one column per feature. Relational data is a set of interconnected tables linked by foreign keys. A customer table is tabular. A customer table linked to orders, products, support tickets, and payment methods is relational. Most enterprise data is relational, but most ML tools treat it as tabular.

How do I handle missing values in structured data?

Three approaches ranked by effectiveness: (1) Model-native handling: XGBoost and LightGBM handle missing values natively by learning optimal split directions. (2) Imputation: replace missing values with median (numerical) or mode (categorical). Multiple imputation provides better uncertainty estimates. (3) Indicator features: add a binary column flagging whether the original value was missing. This captures informative missingness patterns.

Should I use deep learning on tabular data?

For single flat tables: generally no. Gradient-boosted trees match or exceed neural networks on most single-table benchmarks while being faster to train and easier to interpret. For multi-table relational data: yes, specifically graph neural networks. GNNs achieve 75.83 average AUROC on RelBench vs. 62.44 for LightGBM with manual features. The advantage comes from learning cross-table patterns, not from deeper models on flat data.

How do I prepare structured data for ML?

For flat tables: handle missing values, encode categorical variables (one-hot for low cardinality, target encoding for high cardinality), normalize numerical features, and remove data leakage. For relational data: the traditional approach is writing SQL joins and aggregations to flatten multiple tables. Foundation models skip this entirely by learning from the relational structure directly.

What is the biggest challenge with multi-table ML?

Feature engineering across tables. Deciding which tables to join, what aggregations to compute, and over what time windows is a combinatorial problem. With 10 tables, 10 columns per table, 6 aggregation functions, and 4 time windows, the feature space has thousands of combinations. Humans explore fewer than 5% of possible features, missing the multi-hop and temporal patterns that carry the strongest signal.

How do I handle high-cardinality categorical features?

Three options: (1) Target encoding: replace each category with the mean of the target variable for that category, using cross-validation to prevent overfitting. (2) Embedding layers: learn dense vector representations for each category, effective when cardinality exceeds 100. (3) Frequency encoding: replace categories with their occurrence count. For relational data, GNNs and foundation models handle high-cardinality features naturally through their embedding layers.

How much training data do I need for structured data ML?

Rule of thumb for traditional ML: at least 10x the number of features in labeled examples. A model with 200 features needs 2,000+ labeled rows minimum, and 20,000+ for reliable performance. For gradient-boosted trees, 10,000 to 100,000 rows is the typical sweet spot. Foundation models reduce this requirement significantly: zero-shot predictions need no labeled data; fine-tuning works well with 10,000+ examples.

What is data leakage and how do I prevent it?

Data leakage occurs when information from the test set or the future influences training. Common sources: (1) Target leakage: a feature is derived from the target variable. (2) Temporal leakage: future data used to predict the past. (3) Train-test contamination: preprocessing (scaling, encoding) fit on the full dataset instead of training set only. Prevention: use temporal splits, apply preprocessing within cross-validation folds, and audit feature provenance.

Should I normalize features for gradient-boosted trees?

No. Tree-based models (XGBoost, LightGBM, random forests) are invariant to monotonic transformations of features. Normalization, standardization, and log transforms do not change their decisions. Save normalization for neural networks and linear models, which are sensitive to feature scales. This is one reason gradient-boosted trees are popular for tabular data: less preprocessing required.

How do I select features from hundreds of candidates?

Three strategies: (1) Model-based importance: train a gradient-boosted model and rank features by gain or permutation importance. Drop features below a threshold. (2) Recursive feature elimination: iteratively remove the least important feature and retrain. Computationally expensive but thorough. (3) Skip it entirely: relational foundation models learn which signals matter directly from raw data, making manual feature selection unnecessary.

What is the future of ML on structured data?

The field is converging on three shifts: (1) From single-table to multi-table: recognizing that enterprise data is relational, not tabular, and building models that handle the full structure. (2) From feature engineering to learned representations: foundation models that learn directly from raw data. (3) From single-task to multi-task: one model that handles any prediction on a given database, reducing marginal cost per task to near zero.

Machine Learning on Structured Data FAQ: 15 Questions Answered | Kumo.ai

Q: What is structured data in machine learning?

Structured data is data organized in a defined schema with rows and columns, typically stored in relational databases, data warehouses, or spreadsheets. It includes customer records, transaction logs, product catalogs, and event tables. Structured data represents over 80% of enterprise data. It contrasts with unstructured data (text, images, audio) that lacks a predefined schema.

Q: Why is ML on structured data harder than it sounds?

Enterprise structured data is relational: spread across 10 to 50 interconnected tables. ML models require flat input (one row per entity). Bridging this gap through feature engineering takes 80% of a data science team's time. The Stanford RelBench study measured it: 12.3 hours and 878 lines of code per prediction task, even for experienced practitioners.

Over 80% of enterprise data is structured: tables of numbers, dates, categories, and foreign keys stored in relational databases. Yet most ML breakthroughs focus on unstructured data: text (LLMs), images (diffusion models), and code (copilots). These 15 questions address the practical challenges of applying ML to the structured data where most business value actually lives.

1. What is structured data in ML?

Structured data has a defined schema: rows represent entities, columns represent attributes, and every value has a known data type. Customer records, transaction logs, product catalogs, event tables, sensor readings with timestamps. It lives in relational databases (PostgreSQL, MySQL), data warehouses (Snowflake, BigQuery, Redshift), and spreadsheets.

The distinction from unstructured data matters because ML models process them differently. LLMs tokenize text. CNNs process pixel grids. For structured data, the right approach depends on whether your data is a single flat table or a set of interconnected tables.

2. Why is ML on structured data harder than it sounds?

Because enterprise structured data is rarely a single table. It is a relational database with 10 to 50 interconnected tables. Every ML model needs flat input. Converting relational data to flat input is feature engineering, and it consumes 80% of data science time.

The Stanford RelBench study quantified this: 12.3 hours and 878 lines of code per prediction task. That cost compounds. A company that needs 10 predictions spends a year on feature engineering alone.

3. Is XGBoost still the best?

For single flat tables, yes. XGBoost and LightGBM remain the top-performing models on standard tabular benchmarks. A 2024 meta-analysis across 170 datasets confirmed that gradient-boosted trees match or beat neural networks on single-table tasks, while training 10x faster.

But enterprise data is not a single flat table. On RelBench, where data spans multiple connected tables, XGBoost with manually engineered features achieves 62.44 average AUROC. A GNN that operates directly on the relational structure achieves 75.83. KumoRFM zero-shot achieves 76.71. The 14-point gap is not about XGBoost being weak; it is about the feature engineering bottleneck limiting what XGBoost can see.

Single-table ML (XGBoost territory)

One flat table, pre-engineered features
XGBoost/LightGBM are top performers
Fast training (minutes to hours)
Well-understood interpretability tools (SHAP)
Limited to features humans can engineer

Multi-table ML (GNN/FM territory)

Multiple connected tables, raw relational data
GNNs and foundation models outperform by 13+ AUROC points
Eliminates feature engineering (80% of time)
Captures multi-hop and temporal patterns automatically
Explores full feature space, not human-limited subset

4. Tabular data vs. relational data

Tabular data is one table. Relational data is a connected set of tables. This distinction determines which ML approach works best.

A customer table with 20 columns is tabular. Feed it to XGBoost. A customer table linked to orders, linked to products, linked to reviews, linked to other customers is relational. It needs an approach that understands the connections: a GNN or a relational foundation model. Most enterprise ML problems are relational, but most teams treat them as tabular by flattening the structure first.

tabular_data_example (single table)

customer_id	age	income	tenure_months	churned
C-001	34	$72,000	18	No
C-002	51	$120,000	42	No
C-003	28	$45,000	6	Yes

Tabular: one row per customer, all features in one table. XGBoost works well here.

relational_data_example (connected tables)

Table	Sample Row	Connects To
customers	C-001, age=34, income=$72K	orders, support_tickets
orders	O-501, customer=C-001, product=P-88, $142	customers, products
products	P-88, category=Electronics, return_rate=12%	orders, reviews
support_tickets	T-201, customer=C-001, status=escalated	customers
reviews	R-901, product=P-88, customer=C-003, rating=1.5	products, customers

Relational: 5 connected tables. The signal that C-001 bought a product that C-003 (who churned) rated 1.5 stars requires traversing 3 tables. Flattening into one row loses this.

5. Handling missing values

Missing values are universal in enterprise data. Practical approaches:

Use models that handle missingness natively. XGBoost and LightGBM learn optimal split directions for missing values. No imputation needed.
Simple imputation for other models: median for numerical columns, mode for categorical. More sophisticated: KNN imputation or iterative imputation (MICE).
Missingness indicators. Add a binary column flagging whether each value was originally missing. Missingness itself can be informative: a customer who did not provide their phone number behaves differently from one who did.

missing_value_handling_example

customer_id	income	phone	income_imputed	phone_missing_flag	churn_rate
C-001	$72,000	555-0101	$72,000	0	8%
C-002	NULL	555-0202	$68,000 (median)	0	12%
C-003	$45,000	NULL	$45,000	1	24%
C-004	NULL	NULL	$68,000 (median)	1	31%

Customers who did not provide a phone number churn at 3x the rate. The missingness indicator (phone_missing_flag) captures this signal that imputation alone would miss.

6. Deep learning on tabular data?

On single flat tables, deep learning offers no consistent advantage over gradient-boosted trees. Tab-Transformer, FT-Transformer, and similar architectures match XGBoost on some datasets and underperform on others, while being harder to tune and slower to train.

On multi-table relational data, deep learning wins decisively. GNNs achieve 13+ AUROC points over XGBoost on RelBench. The advantage comes not from deeper models on flat data but from architectures (message passing, graph transformers) that operate on the relational structure directly.

7. Data preparation for structured ML

For flat tables, the standard pipeline:

Handle missing values (see question 5)
Encode categorical variables (one-hot for fewer than 20 categories, target encoding for more)
Normalize numerical features (for neural networks; not needed for tree models)
Remove or flag data leakage (features derived from the target, future data)
Split by time for temporal data (never random splits)

For relational data using a foundation model: connect your database. The model handles schema interpretation, encoding, and feature learning automatically.

8. The multi-table challenge

The combinatorial explosion of possible features across multiple tables is the central bottleneck. With 10 tables, each with 10 relevant columns, 6 aggregation functions, and 4 time windows, the space has over 2,400 possible features per direct join path. Add multi-hop joins (3 to 4 tables deep), and the space exceeds 100,000 possible features.

Humans explore fewer than 5% of this space. The features they miss are precisely the multi-hop and temporal patterns that carry the strongest predictive signal.

9. High-cardinality categoricals

A product catalog with 500,000 SKUs or a user base with 10 million IDs cannot be one-hot encoded (too many columns) or dropped (too much signal). Solutions:

Target encoding: replace each category with the smoothed mean of the target for that category, using cross-validation to prevent overfitting
Learned embeddings: neural network embedding layers that map each category to a dense vector, letting the model learn similarity between categories
Frequency encoding: replace each category with its occurrence count, capturing the "rare vs. common" signal

GNNs and foundation models handle high-cardinality features naturally because each entity (product, user) is a node with a learned embedding.

10. How much training data?

Minimum viable: 10x the number of features in labeled examples. A model with 200 features needs at least 2,000 labeled rows. Reliable performance typically requires 10,000+. For gradient-boosted trees, the sweet spot is 10,000 to 100,000 rows.

Foundation models reduce this requirement. Zero-shot predictions require no labeled data (the model uses pre-trained patterns). Fine-tuning works well with 10,000+ examples, which is 10x to 100x less than training from scratch.

11. Preventing data leakage

Data leakage is the silent killer of ML projects. It makes models look accurate in evaluation and fail in production. Three forms:

Target leakage: a feature computed from the target. Example: "days until churn" used to predict churn.
Temporal leakage: future information in training data. Example: using January data to predict December outcomes when training on a random split.
Preprocessing leakage: fitting scalers or encoders on the full dataset including test data.

Prevention: always use temporal splits, audit every feature's data provenance, and apply all preprocessing within cross-validation folds.

12. Normalization for tree models?

No. Gradient-boosted trees split on thresholds and are invariant to monotonic transformations. Normalizing, standardizing, or log-transforming features does not change a tree model's decisions. This is one reason XGBoost dominates tabular benchmarks: less preprocessing, fewer decisions, fewer opportunities for error.

For neural networks on tabular data, normalization matters. Batch normalization or layer normalization is standard in architectures like Tab-Transformer and FT-Transformer.

13. Feature selection strategies

When you have hundreds of candidate features from manual engineering:

Train a gradient-boosted model and use gain-based or permutation importance to rank features. Drop anything below a threshold.
Use recursive feature elimination: iteratively remove the least important feature, retrain, and measure if performance drops.
Apply L1 regularization (lasso) to linear models for automatic sparsity.

Or skip the entire process. Foundation models learn which raw signals matter directly from relational data, making manual feature selection unnecessary.

14. Can LLMs work on structured data?

Poorly. Serializing tables as CSV or JSON and feeding them to an LLM treats structured data as text. On RelBench, Llama 3.2 3B achieved 68.06 AUROC vs. 76.71 for KumoRFM zero-shot. The 8.6-point gap reflects a fundamental mismatch: LLMs process tokens, not schema-aware numerical relationships.

15. The future of ML on structured data

Three shifts are underway. First, the industry is recognizing that enterprise data is relational, not tabular, and building tools accordingly. Second, foundation models are eliminating the feature engineering bottleneck that has held back enterprise ML for a decade. Third, the per-task cost of ML predictions is dropping toward zero, enabling companies to run hundreds of predictions that were previously too expensive to build.

Key Takeaways

1Over 80% of enterprise data is structured, but most ML breakthroughs target unstructured data (text, images). The key distinction: enterprise structured data is relational (10-50 connected tables), not tabular (single flat table). Most teams treat relational data as tabular and lose 13+ AUROC points.
2XGBoost remains the top performer for single flat tables. But on multi-table relational data (RelBench), it achieves only 62.44 AUROC vs. 75.83 for GNNs and 76.71 for KumoRFM zero-shot. The gap is not about model quality -- it is about what the model can see.
3The multi-table challenge is combinatorial: 10 tables with 10 columns, 6 aggregations, and 4 time windows produce 100,000+ possible features. Humans explore fewer than 5%. The missed features are exactly the multi-hop and temporal patterns with the strongest predictive signal.
4LLMs perform poorly on structured data: 68.06 AUROC on RelBench vs. 76.71 for KumoRFM. LLMs process numbers as text tokens and miss schema structure, numerical relationships, and relational patterns that purpose-built models capture.
5Three shifts are underway: from tabular to relational thinking, from feature engineering to learned representations, and from single-task models to multi-task foundation models with near-zero marginal cost per prediction.

Machine Learning on Structured Data FAQ: 15 Questions Answered