Over 80% of enterprise data is structured: tables of numbers, dates, categories, and foreign keys stored in relational databases. Yet most ML breakthroughs focus on unstructured data: text (LLMs), images (diffusion models), and code (copilots). These 15 questions address the practical challenges of applying ML to the structured data where most business value actually lives.
1. What is structured data in ML?
Structured data has a defined schema: rows represent entities, columns represent attributes, and every value has a known data type. Customer records, transaction logs, product catalogs, event tables, sensor readings with timestamps. It lives in relational databases (PostgreSQL, MySQL), data warehouses (Snowflake, BigQuery, Redshift), and spreadsheets.
The distinction from unstructured data matters because ML models process them differently. LLMs tokenize text. CNNs process pixel grids. For structured data, the right approach depends on whether your data is a single flat table or a set of interconnected tables.
2. Why is ML on structured data harder than it sounds?
Because enterprise structured data is rarely a single table. It is a relational database with 10 to 50 interconnected tables. Every ML model needs flat input. Converting relational data to flat input is feature engineering, and it consumes 80% of data science time.
The Stanford RelBench study quantified this: 12.3 hours and 878 lines of code per prediction task. That cost compounds. A company that needs 10 predictions spends a year on feature engineering alone.
3. Is XGBoost still the best?
For single flat tables, yes. XGBoost and LightGBM remain the top-performing models on standard tabular benchmarks. A 2024 meta-analysis across 170 datasets confirmed that gradient-boosted trees match or beat neural networks on single-table tasks, while training 10x faster.
But enterprise data is not a single flat table. On RelBench, where data spans multiple connected tables, XGBoost with manually engineered features achieves 62.44 average AUROC. A GNN that operates directly on the relational structure achieves 75.83. KumoRFM zero-shot achieves 76.71. The 14-point gap is not about XGBoost being weak; it is about the feature engineering bottleneck limiting what XGBoost can see.
Single-table ML (XGBoost territory)
- One flat table, pre-engineered features
- XGBoost/LightGBM are top performers
- Fast training (minutes to hours)
- Well-understood interpretability tools (SHAP)
- Limited to features humans can engineer
Multi-table ML (GNN/FM territory)
- Multiple connected tables, raw relational data
- GNNs and foundation models outperform by 13+ AUROC points
- Eliminates feature engineering (80% of time)
- Captures multi-hop and temporal patterns automatically
- Explores full feature space, not human-limited subset
4. Tabular data vs. relational data
Tabular data is one table. Relational data is a connected set of tables. This distinction determines which ML approach works best.
A customer table with 20 columns is tabular. Feed it to XGBoost. A customer table linked to orders, linked to products, linked to reviews, linked to other customers is relational. It needs an approach that understands the connections: a GNN or a relational foundation model. Most enterprise ML problems are relational, but most teams treat them as tabular by flattening the structure first.
tabular_data_example (single table)
| customer_id | age | income | tenure_months | churned |
|---|---|---|---|---|
| C-001 | 34 | $72,000 | 18 | No |
| C-002 | 51 | $120,000 | 42 | No |
| C-003 | 28 | $45,000 | 6 | Yes |
Tabular: one row per customer, all features in one table. XGBoost works well here.
relational_data_example (connected tables)
| Table | Sample Row | Connects To |
|---|---|---|
| customers | C-001, age=34, income=$72K | orders, support_tickets |
| orders | O-501, customer=C-001, product=P-88, $142 | customers, products |
| products | P-88, category=Electronics, return_rate=12% | orders, reviews |
| support_tickets | T-201, customer=C-001, status=escalated | customers |
| reviews | R-901, product=P-88, customer=C-003, rating=1.5 | products, customers |
Relational: 5 connected tables. The signal that C-001 bought a product that C-003 (who churned) rated 1.5 stars requires traversing 3 tables. Flattening into one row loses this.
5. Handling missing values
Missing values are universal in enterprise data. Practical approaches:
- Use models that handle missingness natively. XGBoost and LightGBM learn optimal split directions for missing values. No imputation needed.
- Simple imputation for other models: median for numerical columns, mode for categorical. More sophisticated: KNN imputation or iterative imputation (MICE).
- Missingness indicators. Add a binary column flagging whether each value was originally missing. Missingness itself can be informative: a customer who did not provide their phone number behaves differently from one who did.
missing_value_handling_example
| customer_id | income | phone | income_imputed | phone_missing_flag | churn_rate |
|---|---|---|---|---|---|
| C-001 | $72,000 | 555-0101 | $72,000 | 0 | 8% |
| C-002 | NULL | 555-0202 | $68,000 (median) | 0 | 12% |
| C-003 | $45,000 | NULL | $45,000 | 1 | 24% |
| C-004 | NULL | NULL | $68,000 (median) | 1 | 31% |
Customers who did not provide a phone number churn at 3x the rate. The missingness indicator (phone_missing_flag) captures this signal that imputation alone would miss.
6. Deep learning on tabular data?
On single flat tables, deep learning offers no consistent advantage over gradient-boosted trees. Tab-Transformer, FT-Transformer, and similar architectures match XGBoost on some datasets and underperform on others, while being harder to tune and slower to train.
On multi-table relational data, deep learning wins decisively. GNNs achieve 13+ AUROC points over XGBoost on RelBench. The advantage comes not from deeper models on flat data but from architectures (message passing, graph transformers) that operate on the relational structure directly.
7. Data preparation for structured ML
For flat tables, the standard pipeline:
- Handle missing values (see question 5)
- Encode categorical variables (one-hot for fewer than 20 categories, target encoding for more)
- Normalize numerical features (for neural networks; not needed for tree models)
- Remove or flag data leakage (features derived from the target, future data)
- Split by time for temporal data (never random splits)
For relational data using a foundation model: connect your database. The model handles schema interpretation, encoding, and feature learning automatically.
8. The multi-table challenge
The combinatorial explosion of possible features across multiple tables is the central bottleneck. With 10 tables, each with 10 relevant columns, 6 aggregation functions, and 4 time windows, the space has over 2,400 possible features per direct join path. Add multi-hop joins (3 to 4 tables deep), and the space exceeds 100,000 possible features.
Humans explore fewer than 5% of this space. The features they miss are precisely the multi-hop and temporal patterns that carry the strongest predictive signal.
9. High-cardinality categoricals
A product catalog with 500,000 SKUs or a user base with 10 million IDs cannot be one-hot encoded (too many columns) or dropped (too much signal). Solutions:
- Target encoding: replace each category with the smoothed mean of the target for that category, using cross-validation to prevent overfitting
- Learned embeddings: neural network embedding layers that map each category to a dense vector, letting the model learn similarity between categories
- Frequency encoding: replace each category with its occurrence count, capturing the "rare vs. common" signal
GNNs and foundation models handle high-cardinality features naturally because each entity (product, user) is a node with a learned embedding.
10. How much training data?
Minimum viable: 10x the number of features in labeled examples. A model with 200 features needs at least 2,000 labeled rows. Reliable performance typically requires 10,000+. For gradient-boosted trees, the sweet spot is 10,000 to 100,000 rows.
Foundation models reduce this requirement. Zero-shot predictions require no labeled data (the model uses pre-trained patterns). Fine-tuning works well with 10,000+ examples, which is 10x to 100x less than training from scratch.
11. Preventing data leakage
Data leakage is the silent killer of ML projects. It makes models look accurate in evaluation and fail in production. Three forms:
- Target leakage: a feature computed from the target. Example: "days until churn" used to predict churn.
- Temporal leakage: future information in training data. Example: using January data to predict December outcomes when training on a random split.
- Preprocessing leakage: fitting scalers or encoders on the full dataset including test data.
Prevention: always use temporal splits, audit every feature's data provenance, and apply all preprocessing within cross-validation folds.
12. Normalization for tree models?
No. Gradient-boosted trees split on thresholds and are invariant to monotonic transformations. Normalizing, standardizing, or log-transforming features does not change a tree model's decisions. This is one reason XGBoost dominates tabular benchmarks: less preprocessing, fewer decisions, fewer opportunities for error.
For neural networks on tabular data, normalization matters. Batch normalization or layer normalization is standard in architectures like Tab-Transformer and FT-Transformer.
13. Feature selection strategies
When you have hundreds of candidate features from manual engineering:
- Train a gradient-boosted model and use gain-based or permutation importance to rank features. Drop anything below a threshold.
- Use recursive feature elimination: iteratively remove the least important feature, retrain, and measure if performance drops.
- Apply L1 regularization (lasso) to linear models for automatic sparsity.
Or skip the entire process. Foundation models learn which raw signals matter directly from relational data, making manual feature selection unnecessary.
14. Can LLMs work on structured data?
Poorly. Serializing tables as CSV or JSON and feeding them to an LLM treats structured data as text. On RelBench, Llama 3.2 3B achieved 68.06 AUROC vs. 76.71 for KumoRFM zero-shot. The 8.6-point gap reflects a fundamental mismatch: LLMs process tokens, not schema-aware numerical relationships.
15. The future of ML on structured data
Three shifts are underway. First, the industry is recognizing that enterprise data is relational, not tabular, and building tools accordingly. Second, foundation models are eliminating the feature engineering bottleneck that has held back enterprise ML for a decade. Third, the per-task cost of ML predictions is dropping toward zero, enabling companies to run hundreds of predictions that were previously too expensive to build.