Which ML approach is best for enterprise use?

It depends on your data structure and team. For single-table data with an established team, manual ML pipelines remain effective. For multi-table relational data with multiple prediction tasks, relational foundation models provide the best accuracy-to-effort ratio: 76.71 AUROC zero-shot on RelBench with no feature engineering, compared to 62.44 for manual LightGBM with full feature engineering.

Can AutoML replace a data science team?

AutoML automates model selection and hyperparameter tuning, which is the last 20% of the ML pipeline. It does not automate feature engineering (80% of time), data quality assessment, or business integration. You still need a data scientist to prepare the flat feature table that AutoML requires as input. AutoML reduces the modeling effort, not the total effort.

When should I use a graph neural network instead of XGBoost?

Use a GNN when your prediction depends on information spread across 3 or more connected tables, when network effects matter (fraud, recommendations, social), or when cold-start entities are common. On RelBench, GNNs outperform XGBoost by 13+ AUROC points on multi-table tasks. For single flat tables, XGBoost remains competitive.

Are LLMs good at structured data prediction?

No. On the RelBench benchmark, Llama 3.2 3B achieved 68.06 AUROC on classification tasks by serializing tables as text. GNNs achieved 75.83 and KumoRFM zero-shot achieved 76.71. LLMs process structured data as tokens and miss the numerical relationships, schema structure, and relational patterns that purpose-built models capture.

What is the total cost difference between these approaches?

For 10 prediction tasks: Manual ML pipelines cost $1.5M-5M (team time, 3-6 months per model). AutoML costs $500K-2M (reduces modeling time but not feature engineering). Custom GNNs cost $1M-3M (expensive first model, cheaper incremental). LLMs on tables cost $200K-500K (compute-heavy, lower accuracy). Foundation models cost $100K-300K (near-zero marginal cost per task).

5 Approaches to Enterprise ML: A Practical Comparison | Kumo.ai

Enterprise ML has a fragmentation problem. Five distinct approaches compete for the same budget, and each vendor claims theirs is best. The truth is that each approach has a genuine sweet spot and genuine limitations. This comparison strips away the marketing and evaluates all five on the dimensions that matter: accuracy, time to value, team requirements, cost per prediction task, and which data structures they handle.

All accuracy numbers come from the RelBench benchmark (7 databases, 30 tasks, 103M+ rows, temporal splits). This is the only benchmark designed for multi-table relational data with proper temporal evaluation.

five_approaches_head_to_head

Metric	Manual ML	AutoML	LLMs on Tables	Custom GNN	Foundation Model
AUROC (RelBench)	62.44	~63-65	68.06	75.83	76.71 / 81.14
Time to 1st Prediction	3-6 months	1-3 months	Hours	3-6 months	Minutes
Cost for 10 Models	$1.5M-5M	$500K-2M	$200K-500K	$1M-3M	$100K-300K
Feature Engineering	100% manual	100% manual	None (serialized)	None (learned)	None (learned)
Team Required	2-3 data scientists	1 data scientist	1 ML engineer	2-3 GNN specialists	SQL-literate analyst
Multi-table Support	Manual joins	Manual joins	Serialization	Native graph	Native graph
Cold-start Entities	No	No	Limited	Yes	Yes
Marginal Cost/Task	$150K-500K	$100K-300K	$50K-100K	$50K-200K	Near-zero

All accuracy numbers from RelBench (7 databases, 30 tasks, 103M+ rows, temporal splits). Foundation models lead on accuracy, speed, and cost simultaneously.

Approach 1: Manual ML pipelines

A team of data scientists writes SQL to engineer features from your relational database, builds a flat feature table, trains a gradient-boosted model (XGBoost, LightGBM), and deploys it through a serving layer.

How it works

The data scientist studies the database schema, writes SQL joins across relevant tables, computes aggregate features (count, sum, average, max, min across time windows), trains a model on the flat output, tunes hyperparameters, validates with temporal splits, and deploys. Each new prediction task repeats this cycle.

Accuracy

On RelBench, LightGBM with features engineered by a Stanford-trained data scientist achieves 62.44 average AUROC on classification tasks. This is the best-effort result with unlimited time, full domain knowledge, and experienced practitioners. The accuracy ceiling is set by what the human can engineer, not by the model's capacity.

Time and cost

3 to 6 months per prediction model. Team of 2 to 3 data scientists at $200K to $300K fully loaded cost each. Per-model cost: $150K to $500K including infrastructure and opportunity cost. For 10 models: $1.5M to $5M over 2 to 3 years.

When it makes sense

Single-table data where feature engineering is minimal
Highly regulated domains requiring full feature transparency
Established teams with deep domain expertise in the specific prediction
One or two high-value models that justify the investment

When it breaks down

Multi-table data requiring complex cross-table features
More than 5 prediction tasks (cost scales linearly with tasks)
Cold-start entities with no historical features
Teams that cannot hire or retain data scientists

what_each_approach_sees_for_customer_C-482

Signal	Manual ML	AutoML	LLM	GNN	Foundation Model
Own attributes (age, balance)	Yes	Yes	Yes	Yes	Yes
30-day order count = 5	Yes	Yes	Yes	Yes	Yes
Orders declining (5,4,3,2,1)	No (aggregated)	No (aggregated)	Partial	Yes	Yes
Bought same products as churners	No (3-hop)	No (3-hop)	No	Yes	Yes
Support agent has low resolution rate	No (2-hop)	No (2-hop)	No	Yes	Yes
Prediction accuracy (AUROC)	62.44	~63-65	68.06	75.83	76.71

Highlighted: signals that only graph-based approaches capture. The 14-point AUROC gap between manual ML and foundation models comes from these invisible multi-hop and temporal patterns.

Approach 2: AutoML platforms

Upload a flat feature table to an AutoML platform. The platform automatically tests hundreds of model architectures, tunes hyperparameters, selects features, and produces a deployable model.

How it works

You prepare a flat feature table (this step is still manual). The platform runs automated experiments: trying logistic regression, random forests, gradient-boosted trees, neural networks, and ensembles. It selects the best model based on cross-validation performance and provides a deployment endpoint.

Accuracy

On single flat tables, AutoML matches expert-tuned models within 1 to 2% accuracy. The platform optimizes the last 20% of the pipeline (model selection, hyperparameters) effectively. But on multi-table relational data, accuracy is capped by the quality of the input feature table, which is still manually engineered. Expected RelBench-equivalent: roughly 63 to 65 AUROC with the same manual features (marginally better model selection does not overcome feature engineering limitations).

Time and cost

Feature engineering: still 4 to 8 weeks per task. Model building: reduced from weeks to hours. Per-model cost: $100K to $300K (feature engineering dominates). Platform license: $50K to $200K per year. For 10 models: $500K to $2M.

When it makes sense

Team has feature engineering capacity but limited modeling expertise
Multiple similar prediction tasks on the same feature table
Need to quickly iterate on model selection and tuning
Compliance requires model comparison documentation

When it breaks down

Feature engineering is the bottleneck (AutoML does not help)
Multi-table data requiring new features for each task
Tasks where feature quality, not model choice, limits accuracy

Manual ML pipelines

Full control over every decision
62.44 AUROC on RelBench (feature-limited)
$150K-500K per model, 3-6 months
Requires 2-3 data scientists per model
Each new task starts from scratch

AutoML platforms

Automates model selection and tuning
~63-65 AUROC (still feature-limited)
$100K-300K per model, 1-2 months faster
Requires 1 data scientist for features
Still needs manual feature engineering

Approach 3: LLMs on tables

Serialize your tables as CSV or JSON text, feed them to a large language model, and prompt it to make predictions.

How it works

Convert table rows into text strings. Feed them to an LLM with a prompt like "Based on this customer's transaction history, will they churn?" The LLM processes the serialized data as a text sequence and outputs a prediction. Some approaches fine-tune the LLM on serialized tabular data.

Accuracy

On RelBench, Llama 3.2 3B achieves 68.06 average AUROC on classification tasks. This is better than the manual LightGBM baseline (62.44) but well below GNNs (75.83) and relational foundation models (76.71). The LLM can apply some patterns from its language pre-training (understanding that "high return rate" is negative), but it misses numerical relationships and graph structure.

Time and cost

Fast to prototype (hours). But inference cost is high: processing serialized tables through a large LLM consumes significant compute. At enterprise scale (millions of predictions), inference costs $50K to $200K per month. Fine-tuning adds $10K to $50K per task.

When it makes sense

Quick prototyping when you need a prediction in hours, not months
Data with significant text content (product descriptions, customer notes)
Low-stakes predictions where 68 AUROC is acceptable
Teams with LLM infrastructure but no tabular ML expertise

When it breaks down

Numerical precision matters (financial data, sensor readings)
Multi-table relational structure carries signal
High-volume predictions where inference cost matters
Accuracy requirements above 70 AUROC

Approach 4: Graph neural networks

Represent your relational database as a graph (rows as nodes, foreign keys as edges) and train a GNN to learn directly from the connected structure.

How it works

Build an ETL pipeline that converts your relational database into a heterogeneous temporal graph. Design a GNN architecture (message passing layers, aggregation functions, temporal encoding). Train on your data with GPU infrastructure. Deploy through a graph serving layer.

Accuracy

On RelBench, a supervised GNN achieves 75.83 average AUROC on classification tasks. That is a 13.4-point improvement over manual feature engineering, reflecting the GNN's ability to discover multi-hop patterns, temporal sequences, and cross-table interactions that humans cannot enumerate.

Time and cost

First model: 3 to 6 months, team of 2 to 3 ML engineers with GNN expertise. Cost: $500K to $1M. Incremental models: $50K to $200K each (graph and architecture are reusable). GPU infrastructure: $5K to $20K per month. For 10 models: $1M to $3M.

When it makes sense

Multi-table relational data with rich connection patterns
Prediction tasks where network effects matter (fraud, recommendations)
Team with GNN expertise and 6+ months of runway
1 to 3 high-value models that justify the infrastructure investment

When it breaks down

No GNN expertise on the team and unable to hire
More than 5 prediction tasks (custom training per task)
Rapid iteration needed (weeks, not months)
Budget constraints on GPU infrastructure

LLMs on tables

Fast to prototype (hours)
68.06 AUROC on RelBench
High inference cost at scale
Misses numerical and relational patterns
Good for text-heavy data

Graph neural networks

3-6 months for first model
75.83 AUROC on RelBench
Efficient inference after training
Captures multi-hop and temporal patterns
Requires specialized GNN expertise

Approach 5: Relational foundation models

A pre-trained model that has already learned universal patterns from thousands of relational databases. Connect your data, write a prediction query, get results. No feature engineering, no model training, no GNN expertise.

How it works

The model is pre-trained on data from 5,000+ diverse relational databases. At inference, you connect your database, and the model reads your schema, constructs a temporal graph internally, and makes predictions. You define the task in PQL (Predictive Query Language), which looks like SQL with a PREDICT clause. Zero-shot predictions are immediate. Fine-tuning takes hours for higher accuracy.

Accuracy

On RelBench, zero-shot achieves 76.71 average AUROC, outperforming the supervised GNN (75.83) without any task-specific training. Fine-tuned achieves 81.14 AUROC. The zero-shot result is the key number: it means the pre-training captured enough universal patterns that task-specific training is optional for many use cases.

Time and cost

Zero-shot: minutes. Fine-tuning: 2 to 8 hours. No ML expertise required (SQL is sufficient). Platform cost: varies by data volume and query frequency. For 10 models: $100K to $300K total, because the marginal cost per additional task approaches zero.

When it makes sense

Multiple prediction tasks (5+) on the same relational database
Time to value matters more than architectural control
Team lacks ML or GNN expertise
Need to evaluate graph ML potential before committing to custom build
Budget-constrained: highest accuracy per dollar spent

When it breaks down

Data is not relational (single flat table, images, text-only)
Need full architectural control for competitive differentiation
Extreme regulatory requirements that prohibit pre-trained models

Head-to-head summary

Dimension	Manual ML	AutoML	LLMs	GNNs	Foundation model
AUROC (RelBench)	62.44	~63-65	68.06	75.83	76.71 / 81.14
Time to first prediction	3-6 months	1-3 months	Hours	3-6 months	Minutes
Cost for 10 models	$1.5M-5M	$500K-2M	$200K-500K	$1M-3M	$100K-300K
Team required	2-3 data scientists	1 data scientist	1 ML engineer	2-3 GNN specialists	SQL-literate analyst
Multi-table handling	Manual joins	Manual joins	Serialization	Native graph	Native graph
Cold-start support	No	No	Limited	Yes	Yes
Feature engineering	100% manual	100% manual	None (serialized)	None (learned)	None (learned)

PQL Query

PREDICT COUNT(orders.*, 0, 30) > 0
FOR EACH customers.customer_id
WHERE customers.segment = 'Enterprise'

This single PQL query delivers what takes 3-6 months and $150K-500K via manual ML. The foundation model reads the relational schema, constructs the graph, and predicts -- no feature engineering, no training, no pipeline.

Output

customer_id	prediction	confidence	approach_comparison
ENT-4821	0.87	high	Manual ML: 3-6 months to match
ENT-1093	0.34	high	AutoML: still needs feature table
ENT-7756	0.15	high	LLM: 68 vs 77 AUROC on this task
ENT-3302	0.94	high	GNN: matches accuracy, 100x slower

Decision framework

Ask three questions to determine which approach fits:

Is your data relational (3+ connected tables)? If no, manual ML or AutoML on a single table is sufficient. If yes, graph-based approaches (GNN or foundation model) provide a structural accuracy advantage.
How many prediction tasks do you need? For 1 to 2, any approach works. For 5+, the marginal cost per task matters, and foundation models win on economics. For 10+, manual approaches become impractical.
Does your team have GNN expertise? If yes and you need maximum architectural control, custom GNNs are justified. If no, a foundation model delivers comparable accuracy without the hiring challenge.

The trend line is clear: enterprise ML is moving from manual, single-task pipelines toward pre-trained, multi-task foundation models. Not because foundation models are always better on a single task, but because the economics of running 10 to 100 predictions make per-task approaches untenable.

KumoRFM was built by the team behind the ML systems at Pinterest, Airbnb, and LinkedIn: Vanja Josifovski (CEO, former CTO at Airbnb and Pinterest), Jure Leskovec (Chief Scientist, Stanford professor, co-creator of GraphSAGE), and Hema Raghavan (Head of Engineering, former Sr. Director at LinkedIn). Backed by Sequoia Capital.

Key Takeaways

1Five approaches compete for enterprise ML budgets. On RelBench (the only multi-table benchmark with temporal splits): Manual ML achieves 62.44 AUROC, AutoML ~63-65, LLMs 68.06, GNNs 75.83, and foundation models 76.71 zero-shot / 81.14 fine-tuned.
2Manual ML and AutoML both require feature engineering (80% of time, $150K-500K per task). They automate different parts of the pipeline but leave the bottleneck intact. AutoML optimizes the last 20%; the first 80% remains manual.
3LLMs on tables are fast to prototype but structurally limited: they process data as text tokens and miss numerical relationships, schema structure, and relational patterns. The 8.6 AUROC point gap vs. foundation models reflects this deep mismatch.
4The cost curve tells the story: at 10 tasks, manual ML costs $1.5M-5M, AutoML $500K-2M, GNNs $1M-3M, and foundation models $100K-300K. The marginal cost per task is what separates them at scale.
5Three questions determine the right approach: Is your data relational (3+ tables)? How many tasks do you need? Does your team have GNN expertise? For most enterprises with relational data and 5+ tasks, the foundation model path dominates.

5 Approaches to Enterprise ML: A Practical Comparison

Approach 1: Manual ML pipelines

How it works

Accuracy

Time and cost

When it makes sense

When it breaks down

Approach 2: AutoML platforms

How it works

Accuracy

Time and cost

When it makes sense

When it breaks down

Approach 3: LLMs on tables

How it works

Accuracy

Time and cost

When it makes sense

When it breaks down

Approach 4: Graph neural networks

How it works

Accuracy

Time and cost

When it makes sense

When it breaks down

Approach 5: Relational foundation models

How it works

Accuracy

Time and cost

When it makes sense

When it breaks down

Head-to-head summary

Decision framework

Frequently asked questions

Which ML approach is best for enterprise use?

Can AutoML replace a data science team?

When should I use a graph neural network instead of XGBoost?

Are LLMs good at structured data prediction?

What is the total cost difference between these approaches?

Related topics

See it in action