Enterprise ML has a fragmentation problem. Five distinct approaches compete for the same budget, and each vendor claims theirs is best. The truth is that each approach has a genuine sweet spot and genuine limitations. This comparison strips away the marketing and evaluates all five on the dimensions that matter: accuracy, time to value, team requirements, cost per prediction task, and which data structures they handle.
All accuracy numbers come from the RelBench benchmark (7 databases, 30 tasks, 103M+ rows, temporal splits). This is the only benchmark designed for multi-table relational data with proper temporal evaluation.
five_approaches_head_to_head
| Metric | Manual ML | AutoML | LLMs on Tables | Custom GNN | Foundation Model |
|---|---|---|---|---|---|
| AUROC (RelBench) | 62.44 | ~63-65 | 68.06 | 75.83 | 76.71 / 81.14 |
| Time to 1st Prediction | 3-6 months | 1-3 months | Hours | 3-6 months | Minutes |
| Cost for 10 Models | $1.5M-5M | $500K-2M | $200K-500K | $1M-3M | $100K-300K |
| Feature Engineering | 100% manual | 100% manual | None (serialized) | None (learned) | None (learned) |
| Team Required | 2-3 data scientists | 1 data scientist | 1 ML engineer | 2-3 GNN specialists | SQL-literate analyst |
| Multi-table Support | Manual joins | Manual joins | Serialization | Native graph | Native graph |
| Cold-start Entities | No | No | Limited | Yes | Yes |
| Marginal Cost/Task | $150K-500K | $100K-300K | $50K-100K | $50K-200K | Near-zero |
All accuracy numbers from RelBench (7 databases, 30 tasks, 103M+ rows, temporal splits). Foundation models lead on accuracy, speed, and cost simultaneously.
Approach 1: Manual ML pipelines
A team of data scientists writes SQL to engineer features from your relational database, builds a flat feature table, trains a gradient-boosted model (XGBoost, LightGBM), and deploys it through a serving layer.
How it works
The data scientist studies the database schema, writes SQL joins across relevant tables, computes aggregate features (count, sum, average, max, min across time windows), trains a model on the flat output, tunes hyperparameters, validates with temporal splits, and deploys. Each new prediction task repeats this cycle.
Accuracy
On RelBench, LightGBM with features engineered by a Stanford-trained data scientist achieves 62.44 average AUROC on classification tasks. This is the best-effort result with unlimited time, full domain knowledge, and experienced practitioners. The accuracy ceiling is set by what the human can engineer, not by the model's capacity.
Time and cost
3 to 6 months per prediction model. Team of 2 to 3 data scientists at $200K to $300K fully loaded cost each. Per-model cost: $150K to $500K including infrastructure and opportunity cost. For 10 models: $1.5M to $5M over 2 to 3 years.
When it makes sense
- Single-table data where feature engineering is minimal
- Highly regulated domains requiring full feature transparency
- Established teams with deep domain expertise in the specific prediction
- One or two high-value models that justify the investment
When it breaks down
- Multi-table data requiring complex cross-table features
- More than 5 prediction tasks (cost scales linearly with tasks)
- Cold-start entities with no historical features
- Teams that cannot hire or retain data scientists
what_each_approach_sees_for_customer_C-482
| Signal | Manual ML | AutoML | LLM | GNN | Foundation Model |
|---|---|---|---|---|---|
| Own attributes (age, balance) | Yes | Yes | Yes | Yes | Yes |
| 30-day order count = 5 | Yes | Yes | Yes | Yes | Yes |
| Orders declining (5,4,3,2,1) | No (aggregated) | No (aggregated) | Partial | Yes | Yes |
| Bought same products as churners | No (3-hop) | No (3-hop) | No | Yes | Yes |
| Support agent has low resolution rate | No (2-hop) | No (2-hop) | No | Yes | Yes |
| Prediction accuracy (AUROC) | 62.44 | ~63-65 | 68.06 | 75.83 | 76.71 |
Highlighted: signals that only graph-based approaches capture. The 14-point AUROC gap between manual ML and foundation models comes from these invisible multi-hop and temporal patterns.
Approach 2: AutoML platforms
Upload a flat feature table to an AutoML platform. The platform automatically tests hundreds of model architectures, tunes hyperparameters, selects features, and produces a deployable model.
How it works
You prepare a flat feature table (this step is still manual). The platform runs automated experiments: trying logistic regression, random forests, gradient-boosted trees, neural networks, and ensembles. It selects the best model based on cross-validation performance and provides a deployment endpoint.
Accuracy
On single flat tables, AutoML matches expert-tuned models within 1 to 2% accuracy. The platform optimizes the last 20% of the pipeline (model selection, hyperparameters) effectively. But on multi-table relational data, accuracy is capped by the quality of the input feature table, which is still manually engineered. Expected RelBench-equivalent: roughly 63 to 65 AUROC with the same manual features (marginally better model selection does not overcome feature engineering limitations).
Time and cost
Feature engineering: still 4 to 8 weeks per task. Model building: reduced from weeks to hours. Per-model cost: $100K to $300K (feature engineering dominates). Platform license: $50K to $200K per year. For 10 models: $500K to $2M.
When it makes sense
- Team has feature engineering capacity but limited modeling expertise
- Multiple similar prediction tasks on the same feature table
- Need to quickly iterate on model selection and tuning
- Compliance requires model comparison documentation
When it breaks down
- Feature engineering is the bottleneck (AutoML does not help)
- Multi-table data requiring new features for each task
- Tasks where feature quality, not model choice, limits accuracy
Manual ML pipelines
- Full control over every decision
- 62.44 AUROC on RelBench (feature-limited)
- $150K-500K per model, 3-6 months
- Requires 2-3 data scientists per model
- Each new task starts from scratch
AutoML platforms
- Automates model selection and tuning
- ~63-65 AUROC (still feature-limited)
- $100K-300K per model, 1-2 months faster
- Requires 1 data scientist for features
- Still needs manual feature engineering
Approach 3: LLMs on tables
Serialize your tables as CSV or JSON text, feed them to a large language model, and prompt it to make predictions.
How it works
Convert table rows into text strings. Feed them to an LLM with a prompt like "Based on this customer's transaction history, will they churn?" The LLM processes the serialized data as a text sequence and outputs a prediction. Some approaches fine-tune the LLM on serialized tabular data.
Accuracy
On RelBench, Llama 3.2 3B achieves 68.06 average AUROC on classification tasks. This is better than the manual LightGBM baseline (62.44) but well below GNNs (75.83) and relational foundation models (76.71). The LLM can apply some patterns from its language pre-training (understanding that "high return rate" is negative), but it misses numerical relationships and graph structure.
Time and cost
Fast to prototype (hours). But inference cost is high: processing serialized tables through a large LLM consumes significant compute. At enterprise scale (millions of predictions), inference costs $50K to $200K per month. Fine-tuning adds $10K to $50K per task.
When it makes sense
- Quick prototyping when you need a prediction in hours, not months
- Data with significant text content (product descriptions, customer notes)
- Low-stakes predictions where 68 AUROC is acceptable
- Teams with LLM infrastructure but no tabular ML expertise
When it breaks down
- Numerical precision matters (financial data, sensor readings)
- Multi-table relational structure carries signal
- High-volume predictions where inference cost matters
- Accuracy requirements above 70 AUROC
Approach 4: Graph neural networks
Represent your relational database as a graph (rows as nodes, foreign keys as edges) and train a GNN to learn directly from the connected structure.
How it works
Build an ETL pipeline that converts your relational database into a heterogeneous temporal graph. Design a GNN architecture (message passing layers, aggregation functions, temporal encoding). Train on your data with GPU infrastructure. Deploy through a graph serving layer.
Accuracy
On RelBench, a supervised GNN achieves 75.83 average AUROC on classification tasks. That is a 13.4-point improvement over manual feature engineering, reflecting the GNN's ability to discover multi-hop patterns, temporal sequences, and cross-table interactions that humans cannot enumerate.
Time and cost
First model: 3 to 6 months, team of 2 to 3 ML engineers with GNN expertise. Cost: $500K to $1M. Incremental models: $50K to $200K each (graph and architecture are reusable). GPU infrastructure: $5K to $20K per month. For 10 models: $1M to $3M.
When it makes sense
- Multi-table relational data with rich connection patterns
- Prediction tasks where network effects matter (fraud, recommendations)
- Team with GNN expertise and 6+ months of runway
- 1 to 3 high-value models that justify the infrastructure investment
When it breaks down
- No GNN expertise on the team and unable to hire
- More than 5 prediction tasks (custom training per task)
- Rapid iteration needed (weeks, not months)
- Budget constraints on GPU infrastructure
LLMs on tables
- Fast to prototype (hours)
- 68.06 AUROC on RelBench
- High inference cost at scale
- Misses numerical and relational patterns
- Good for text-heavy data
Graph neural networks
- 3-6 months for first model
- 75.83 AUROC on RelBench
- Efficient inference after training
- Captures multi-hop and temporal patterns
- Requires specialized GNN expertise
Approach 5: Relational foundation models
A pre-trained model that has already learned universal patterns from thousands of relational databases. Connect your data, write a prediction query, get results. No feature engineering, no model training, no GNN expertise.
How it works
The model is pre-trained on data from 5,000+ diverse relational databases. At inference, you connect your database, and the model reads your schema, constructs a temporal graph internally, and makes predictions. You define the task in PQL (Predictive Query Language), which looks like SQL with a PREDICT clause. Zero-shot predictions are immediate. Fine-tuning takes hours for higher accuracy.
Accuracy
On RelBench, zero-shot achieves 76.71 average AUROC, outperforming the supervised GNN (75.83) without any task-specific training. Fine-tuned achieves 81.14 AUROC. The zero-shot result is the key number: it means the pre-training captured enough universal patterns that task-specific training is optional for many use cases.
Time and cost
Zero-shot: minutes. Fine-tuning: 2 to 8 hours. No ML expertise required (SQL is sufficient). Platform cost: varies by data volume and query frequency. For 10 models: $100K to $300K total, because the marginal cost per additional task approaches zero.
When it makes sense
- Multiple prediction tasks (5+) on the same relational database
- Time to value matters more than architectural control
- Team lacks ML or GNN expertise
- Need to evaluate graph ML potential before committing to custom build
- Budget-constrained: highest accuracy per dollar spent
When it breaks down
- Data is not relational (single flat table, images, text-only)
- Need full architectural control for competitive differentiation
- Extreme regulatory requirements that prohibit pre-trained models
Head-to-head summary
| Dimension | Manual ML | AutoML | LLMs | GNNs | Foundation model |
|---|---|---|---|---|---|
| AUROC (RelBench) | 62.44 | ~63-65 | 68.06 | 75.83 | 76.71 / 81.14 |
| Time to first prediction | 3-6 months | 1-3 months | Hours | 3-6 months | Minutes |
| Cost for 10 models | $1.5M-5M | $500K-2M | $200K-500K | $1M-3M | $100K-300K |
| Team required | 2-3 data scientists | 1 data scientist | 1 ML engineer | 2-3 GNN specialists | SQL-literate analyst |
| Multi-table handling | Manual joins | Manual joins | Serialization | Native graph | Native graph |
| Cold-start support | No | No | Limited | Yes | Yes |
| Feature engineering | 100% manual | 100% manual | None (serialized) | None (learned) | None (learned) |
PQL Query
PREDICT COUNT(orders.*, 0, 30) > 0 FOR EACH customers.customer_id WHERE customers.segment = 'Enterprise'
This single PQL query delivers what takes 3-6 months and $150K-500K via manual ML. The foundation model reads the relational schema, constructs the graph, and predicts -- no feature engineering, no training, no pipeline.
Output
| customer_id | prediction | confidence | approach_comparison |
|---|---|---|---|
| ENT-4821 | 0.87 | high | Manual ML: 3-6 months to match |
| ENT-1093 | 0.34 | high | AutoML: still needs feature table |
| ENT-7756 | 0.15 | high | LLM: 68 vs 77 AUROC on this task |
| ENT-3302 | 0.94 | high | GNN: matches accuracy, 100x slower |
Decision framework
Ask three questions to determine which approach fits:
- Is your data relational (3+ connected tables)? If no, manual ML or AutoML on a single table is sufficient. If yes, graph-based approaches (GNN or foundation model) provide a structural accuracy advantage.
- How many prediction tasks do you need? For 1 to 2, any approach works. For 5+, the marginal cost per task matters, and foundation models win on economics. For 10+, manual approaches become impractical.
- Does your team have GNN expertise? If yes and you need maximum architectural control, custom GNNs are justified. If no, a foundation model delivers comparable accuracy without the hiring challenge.
The trend line is clear: enterprise ML is moving from manual, single-task pipelines toward pre-trained, multi-task foundation models. Not because foundation models are always better on a single task, but because the economics of running 10 to 100 predictions make per-task approaches untenable.
KumoRFM was built by the team behind the ML systems at Pinterest, Airbnb, and LinkedIn: Vanja Josifovski (CEO, former CTO at Airbnb and Pinterest), Jure Leskovec (Chief Scientist, Stanford professor, co-creator of GraphSAGE), and Hema Raghavan (Head of Engineering, former Sr. Director at LinkedIn). Backed by Sequoia Capital.