Every enterprise relational database has the same fundamental structure: entities stored in tables, connected by foreign keys. Customers link to orders. Orders link to products. Products link to categories. Patients link to diagnoses, prescriptions, and lab results. The data is inherently connected.
Every mainstream ML model has the same fundamental requirement: a flat table. One row per entity, one column per feature. No foreign keys, no joins, no multi-table structure.
This mismatch is the central bottleneck in enterprise ML. It explains why data scientists spend 80% of their time on feature engineering. It explains why each prediction task takes 3 to 6 months. And it explains why most companies have fewer than 10 ML models in production despite sitting on hundreds of potential use cases.
Three approaches exist to bridge this gap. Each makes different trade-offs between accuracy, speed, and engineering effort. This guide covers all three with benchmark results from RelBench, the standard evaluation suite for ML on relational databases.
The problem: relational data vs. flat tables
A typical enterprise database has 10 to 50 tables. The RelBench benchmark, which is the standard for evaluating ML on relational databases, includes datasets ranging from 3 tables (Amazon product data) to 15 tables (clinical trial data) with up to 41 million rows.
When you want to predict customer churn, the relevant information is scattered across multiple tables: customer demographics, order history, product details, support interactions, payment methods, website activity. Each table has a different granularity (one row per customer vs. one row per order vs. one row per page view), and the relationships between tables carry as much signal as the values within them.
The challenge is not just joining tables. It is deciding what to compute from those joins. Total spend? Average order value? Number of distinct products? Over what time window? And that is just for direct relationships. Multi-hop patterns (customers who bought products that other churning customers also bought) carry strong signal but require traversing 3 or 4 tables.
example_ecommerce_schema
| Table | Rows | Key Columns | Connects To |
|---|---|---|---|
| customers | 2.1M | customer_id, signup_date, segment | orders, support_tickets |
| orders | 18.4M | order_id, customer_id, product_id, amount, date | customers, products |
| products | 145K | product_id, category, price, brand | orders, reviews |
| support_tickets | 3.2M | ticket_id, customer_id, category, resolved | customers |
| reviews | 8.7M | review_id, product_id, customer_id, rating, date | products, customers |
A typical e-commerce relational database. The prediction challenge: flatten 5 tables with 32M+ rows into one row per customer.
Approach 1: Flatten and aggregate
This is the default approach used by 90% of enterprise data science teams. You write SQL to join tables, compute aggregate features, and produce a flat table that a standard model (XGBoost, logistic regression, neural network) can consume.
To make the comparison concrete, here is the same prediction task (will this customer reorder within 30 days?) evaluated across all three approaches on the same data.
same_task_three_approaches
| Customer | Flatten+Aggregate | GNN | Foundation Model | Actual |
|---|---|---|---|---|
| C-001 (steady buyer) | 0.62 | 0.78 | 0.81 | Reordered (day 12) |
| C-002 (new, 1 order) | 0.50 (no signal) | 0.71 | 0.74 | Reordered (day 8) |
| C-003 (declining) | 0.58 | 0.34 | 0.29 | Did not reorder |
| C-004 (competitor product buyer) | 0.61 | 0.22 | 0.19 | Did not reorder |
C-002 is a cold-start customer (1 order). The flat model gives 0.50 (coin flip). GNN/FM use graph connections to predict 0.71/0.74. C-004 bought products that other churners also bought (multi-hop signal) -- only GNN/FM detect this.
How it works
For each prediction target, a data scientist writes JOIN queries across the relevant tables, applies aggregation functions (COUNT, SUM, AVG, MAX, MIN, COUNT DISTINCT) over various time windows (7 days, 30 days, 90 days, all time), and outputs a single table with one row per entity and hundreds of computed columns.
A typical churn model might have 200 to 500 features engineered from 5 to 10 source tables. The SQL for this runs 500 to 2,000 lines. The Stanford study measured the average cost at 12.3 hours and 878 lines of code per prediction task, even for experienced data scientists.
flattened_feature_table_example
| customer_id | orders_30d | avg_amount | days_since_last | tickets_open | avg_review |
|---|---|---|---|---|---|
| C-001 | 3 | $87.50 | 5 | 0 | 4.2 |
| C-002 | 0 | $0.00 | 94 | 2 | 2.1 |
| C-003 | 7 | $142.30 | 1 | 0 | 4.8 |
| C-004 | 1 | $34.00 | 28 | 1 | 3.5 |
What XGBoost sees after flattening. Missing: order sequence, ticket categories, product return rates, similar-customer behavior, temporal acceleration.
What it gets right
- Uses battle-tested ML models (XGBoost, LightGBM) that data scientists understand well
- Interpretable features that business stakeholders can validate
- Low-latency inference (no graph lookups at serving time)
- Mature tooling for model monitoring, A/B testing, and deployment
What it misses
- Multi-hop patterns. A customer's churn risk depends on the behavior of similar customers 2 to 3 hops away. Nobody writes these features because the join paths are too complex to enumerate manually.
- Temporal sequences. Aggregating "5 orders in 30 days" destroys the sequence: were they accelerating, decelerating, or clustered? The ordering carries signal that averages erase.
- Feature interactions. The combination of high return rate + declining order value + increasing support tickets is more predictive than any single feature. The combinatorial space of interactions is too large to engineer by hand.
Benchmark results
On RelBench (7 databases, 30 tasks, 103M+ rows), LightGBM with features engineered by a Stanford-trained data scientist achieves an average AUROC of 62.44 on classification tasks. This is the baseline that represents best-effort manual feature engineering with unlimited time.
Approach 2: Graph neural networks
Instead of flattening the relational structure, represent it as a graph and train a model that operates directly on that graph.
How it works
The relational database is converted into a heterogeneous graph. Each table row becomes a node, tagged with its entity type (customer, order, product). Each foreign key becomes an edge connecting the relevant nodes. Timestamps on rows create a temporal dimension, allowing the model to reason about sequences and recency.
A graph neural network then learns through message passing. In each layer, every node aggregates information from its neighbors, applies a learned transformation, and updates its representation. After k layers, each node encodes information from all entities within k hops. A 3-layer GNN on a customer node captures the customer's orders, the products in those orders, other customers who bought those products, and their behavior patterns.
What it gets right
- Preserves the relational structure, no information loss from flattening
- Automatically discovers multi-hop patterns that manual engineering misses
- Temporal encoding captures sequence patterns that aggregates destroy
- One architecture handles any prediction task on the same graph
What it requires
- Specialized expertise. GNN architecture design, message passing schemes, neighborhood sampling, and temporal encoding are not standard data science skills. Most teams need to hire or upskill.
- Graph construction pipeline. Converting a relational database to a graph requires an ETL pipeline that handles schema mapping, edge creation, temporal ordering, and incremental updates.
- GPU infrastructure. GNN training on enterprise-scale graphs requires 1 to 8 GPUs depending on graph size. Training runs take hours to days.
- Time. First production model takes 3 to 6 months for a team of 2 to 3 ML engineers.
Benchmark results
On RelBench, a supervised GNN (trained per task) achieves an average AUROC of 75.83 on classification tasks. That is a 13.4-point improvement over the LightGBM baseline. On regression tasks, the improvement averages 15 to 25% in MAE reduction.
Flatten and aggregate
- 12.3 hours per prediction task
- 878 lines of code per task
- 62.44 average AUROC on RelBench
- Misses multi-hop and temporal patterns
- Rebuild from scratch for every new task
Graph neural network
- 3-6 months for first model, then reusable
- Architecture handles any task on the graph
- 75.83 average AUROC on RelBench
- Captures multi-hop and temporal patterns
- Requires GNN expertise and GPU infrastructure
PQL Query
PREDICT COUNT(orders.*, 0, 30) > 0 FOR EACH customers.customer_id WHERE customers.signup_date < '2025-01-01'
A GNN learns from the full graph: customer -> orders -> products -> other customers -> their behavior. This multi-hop traversal captures patterns no flat table can represent.
Output
| customer_id | churn_probability | hop_depth_signal | top_pattern |
|---|---|---|---|
| C-001 | 0.12 | 2-hop | Similar customers highly active |
| C-002 | 0.89 | 3-hop | Products they bought have 40% return rate |
| C-003 | 0.05 | 1-hop | Accelerating order frequency |
| C-004 | 0.67 | 2-hop | Support agent has low resolution rate |
Approach 3: Relational foundation model
A foundation model pre-trained on relational data from thousands of diverse databases. It learns universal patterns during pre-training and applies them to any new database at inference time.
How it works
KumoRFM, the first relational foundation model, was pre-trained on data from over 5,000 relational databases spanning e-commerce, financial services, healthcare, manufacturing, and SaaS. During pre-training, it learned the universal patterns that recur across relational data: recency effects, frequency patterns, temporal dynamics, graph topology, cross-table signal propagation.
At inference time, you connect your relational database and write a prediction query in PQL (Predictive Query Language). The model reads your schema, constructs a temporal graph internally, and produces predictions. No training step, no feature engineering, no graph construction pipeline.
For higher accuracy on specific tasks, you can fine-tune the model on your data. Fine-tuning takes hours, not months, because the model already understands relational patterns and only needs to adapt to your specific schema and distribution.
What it gets right
- Zero-shot predictions in seconds, no training or engineering required
- Matches or exceeds supervised GNNs on most tasks without task-specific training
- One model handles any prediction task on any relational database
- No ML expertise required; PQL looks like SQL with a PREDICT clause
What it requires
- Trust in a pre-trained model (similar to using GPT vs. training your own LLM)
- Database connectivity for the model to read your schema and data
- Willingness to validate predictions before production deployment
Benchmark results
On RelBench, KumoRFM zero-shot achieves an average AUROC of 76.71 on classification tasks, outperforming the supervised GNN baseline (75.83) without any task-specific training. Fine-tuned, it reaches 81.14 AUROC. On regression tasks, zero-shot MAE is competitive with trained GNNs, and fine-tuning reduces error by an additional 10 to 20%.
Head-to-head comparison
Here are the three approaches compared across the dimensions that matter for enterprise deployment.
| Dimension | Flatten + aggregate | GNN | Foundation model |
|---|---|---|---|
| AUROC (RelBench avg) | 62.44 | 75.83 | 76.71 (zero-shot) / 81.14 (fine-tuned) |
| Time to first prediction | 2-6 months | 3-6 months | Minutes (zero-shot) / hours (fine-tuned) |
| Team required | 2-3 data scientists | 2-3 ML engineers with GNN skills | Anyone who can write SQL |
| Per-task marginal cost | $150K-500K (team time) | $50K-200K (after first model) | Near-zero (same model, new query) |
| Multi-hop signals | Rarely captured | Captured automatically | Captured automatically |
| Temporal patterns | Lost in aggregation | Preserved with temporal encoding | Preserved with temporal encoding |
| Cold-start entities | No prediction possible | Predictions from graph structure | Predictions from graph structure |
When to use each approach
Use flatten-and-aggregate when:
- Your data genuinely lives in a single table
- You have fewer than 3 interconnected tables
- Regulatory requirements demand fully interpretable features
- You are building a proof of concept with a 2-week deadline
Use a custom GNN when:
- You have a unique graph structure that differs significantly from standard relational schemas
- You need full control over the model architecture for research or competitive advantage
- You have 2 to 3 ML engineers with GNN experience and 6 months of runway
- You are building 1 to 2 high-stakes models, not a portfolio
Use a relational foundation model when:
- You need predictions across 5 or more tasks on the same relational database
- Time to value matters more than architectural control
- Your team does not have GNN expertise and cannot hire for it
- You want to evaluate graph ML's potential before committing to a custom build
The economics of each approach
The cost difference becomes dramatic when you move beyond a single prediction task.
Flatten-and-aggregate: Each new prediction task requires a new round of feature engineering. If you need 10 models, you need 10 rounds of SQL writing, feature selection, and model training. At $150K to $500K per model (team time, infrastructure, opportunity cost), a portfolio of 10 models costs $1.5M to $5M.
Custom GNN: The first model is expensive (6 months, $500K to $1M). But the graph and architecture are reusable. Each additional task costs $50K to $200K for fine-tuning and validation. 10 models cost $1M to $3M total.
Foundation model: Connect your database once. Each new prediction task is a new PQL query. The marginal cost per task approaches zero. 10 models cost the platform fee plus validation time. Total: $100K to $300K depending on data volume and query frequency.
Getting started
If your data lives in a relational database with 3 or more tables, start by benchmarking all three approaches on one prediction task. Use your existing flat model as the baseline, run a zero-shot foundation model prediction as the quick test, and evaluate whether the accuracy difference justifies changing your approach.
Most teams find that the foundation model matches or exceeds their manual pipeline on the first attempt. The ones that do not typically have extremely domain-specific data (proprietary sensor readings, custom encodings) where fine-tuning closes the gap.
The relational data prediction problem is solved. The question is no longer whether to move beyond flatten-and-aggregate, but which path makes sense for your team, your data, and your portfolio of use cases.
economics_by_task_count
| Number of Tasks | Flatten+Aggregate | Custom GNN | Foundation Model |
|---|---|---|---|
| 1 task (3-year) | $400K-800K | $500K-1M | $100K-150K |
| 5 tasks (3-year) | $1.5M-3M | $1M-2M | $200K-350K |
| 10 tasks (3-year) | $3M-5M | $1.5M-3M | $300K-420K |
| 20 tasks (3-year) | $6M-10M | $2.5M-5M | $400K-600K |
The cost advantage of foundation models scales with task count. At 10+ tasks, the gap is 8-10x.