Why can't traditional ML models read relational databases directly?

Traditional ML models (XGBoost, random forests, logistic regression, neural networks) require a flat input: one row per entity, one column per feature. Relational databases store data across multiple tables linked by foreign keys. The structure cannot be directly consumed by these models. Someone or something must first flatten the relational structure into a single table, which is what feature engineering does.

What is the flatten-and-aggregate approach to relational ML?

Flatten-and-aggregate is the traditional approach: join multiple tables using SQL, compute aggregations (count, sum, average, max, min) over various time windows, and produce a single flat table of features. This is how most enterprise ML teams work today. A Stanford study measured the cost at 12.3 hours and 878 lines of code per prediction task.

How do graph neural networks handle relational data?

GNNs represent a relational database as a graph: rows become nodes, foreign keys become edges. The GNN learns node embeddings through message passing, where each node aggregates information from its neighbors across multiple hops. This preserves the relational structure and lets the model discover cross-table patterns that manual feature engineering misses.

What is a relational foundation model?

A relational foundation model is a neural network pre-trained on data from thousands of diverse relational databases. Like GPT for text, it learns universal patterns in relational data (recency, frequency, temporal dynamics, graph topology) during pre-training. At inference time, it can make predictions on any new relational database without task-specific training.

Which approach to relational predictions is best for most enterprises?

For most enterprises running 5 or more prediction tasks on relational databases, a relational foundation model provides the best ROI. It matches or exceeds GNN accuracy (76.71 vs 75.83 AUROC on RelBench zero-shot), requires no feature engineering, no ML expertise, and delivers predictions in seconds rather than months. For a single high-stakes model, a custom GNN may be justified.

Making Predictions on Relational Data: The Complete Guide | Kumo.ai

Every enterprise relational database has the same fundamental structure: entities stored in tables, connected by foreign keys. Customers link to orders. Orders link to products. Products link to categories. Patients link to diagnoses, prescriptions, and lab results. The data is inherently connected.

Every mainstream ML model has the same fundamental requirement: a flat table. One row per entity, one column per feature. No foreign keys, no joins, no multi-table structure.

This mismatch is the central bottleneck in enterprise ML. It explains why data scientists spend 80% of their time on feature engineering. It explains why each prediction task takes 3 to 6 months. And it explains why most companies have fewer than 10 ML models in production despite sitting on hundreds of potential use cases.

Three approaches exist to bridge this gap. Each makes different trade-offs between accuracy, speed, and engineering effort. This guide covers all three with benchmark results from RelBench, the standard evaluation suite for ML on relational databases.

The problem: relational data vs. flat tables

A typical enterprise database has 10 to 50 tables. The RelBench benchmark, which is the standard for evaluating ML on relational databases, includes datasets ranging from 3 tables (Amazon product data) to 15 tables (clinical trial data) with up to 41 million rows.

When you want to predict customer churn, the relevant information is scattered across multiple tables: customer demographics, order history, product details, support interactions, payment methods, website activity. Each table has a different granularity (one row per customer vs. one row per order vs. one row per page view), and the relationships between tables carry as much signal as the values within them.

The challenge is not just joining tables. It is deciding what to compute from those joins. Total spend? Average order value? Number of distinct products? Over what time window? And that is just for direct relationships. Multi-hop patterns (customers who bought products that other churning customers also bought) carry strong signal but require traversing 3 or 4 tables.

example_ecommerce_schema

Table	Rows	Key Columns	Connects To
customers	2.1M	customer_id, signup_date, segment	orders, support_tickets
orders	18.4M	order_id, customer_id, product_id, amount, date	customers, products
products	145K	product_id, category, price, brand	orders, reviews
support_tickets	3.2M	ticket_id, customer_id, category, resolved	customers
reviews	8.7M	review_id, product_id, customer_id, rating, date	products, customers

A typical e-commerce relational database. The prediction challenge: flatten 5 tables with 32M+ rows into one row per customer.

Approach 1: Flatten and aggregate

This is the default approach used by 90% of enterprise data science teams. You write SQL to join tables, compute aggregate features, and produce a flat table that a standard model (XGBoost, logistic regression, neural network) can consume.

To make the comparison concrete, here is the same prediction task (will this customer reorder within 30 days?) evaluated across all three approaches on the same data.

same_task_three_approaches

Customer	Flatten+Aggregate	GNN	Foundation Model	Actual
C-001 (steady buyer)	0.62	0.78	0.81	Reordered (day 12)
C-002 (new, 1 order)	0.50 (no signal)	0.71	0.74	Reordered (day 8)
C-003 (declining)	0.58	0.34	0.29	Did not reorder
C-004 (competitor product buyer)	0.61	0.22	0.19	Did not reorder

C-002 is a cold-start customer (1 order). The flat model gives 0.50 (coin flip). GNN/FM use graph connections to predict 0.71/0.74. C-004 bought products that other churners also bought (multi-hop signal) -- only GNN/FM detect this.

How it works

For each prediction target, a data scientist writes JOIN queries across the relevant tables, applies aggregation functions (COUNT, SUM, AVG, MAX, MIN, COUNT DISTINCT) over various time windows (7 days, 30 days, 90 days, all time), and outputs a single table with one row per entity and hundreds of computed columns.

A typical churn model might have 200 to 500 features engineered from 5 to 10 source tables. The SQL for this runs 500 to 2,000 lines. The Stanford study measured the average cost at 12.3 hours and 878 lines of code per prediction task, even for experienced data scientists.

flattened_feature_table_example

customer_id	orders_30d	avg_amount	days_since_last	tickets_open	avg_review
C-001	3	$87.50	5	0	4.2
C-002	0	$0.00	94	2	2.1
C-003	7	$142.30	1	0	4.8
C-004	1	$34.00	28	1	3.5

What XGBoost sees after flattening. Missing: order sequence, ticket categories, product return rates, similar-customer behavior, temporal acceleration.

What it gets right

Uses battle-tested ML models (XGBoost, LightGBM) that data scientists understand well
Interpretable features that business stakeholders can validate
Low-latency inference (no graph lookups at serving time)
Mature tooling for model monitoring, A/B testing, and deployment

What it misses

Multi-hop patterns. A customer's churn risk depends on the behavior of similar customers 2 to 3 hops away. Nobody writes these features because the join paths are too complex to enumerate manually.
Temporal sequences. Aggregating "5 orders in 30 days" destroys the sequence: were they accelerating, decelerating, or clustered? The ordering carries signal that averages erase.
Feature interactions. The combination of high return rate + declining order value + increasing support tickets is more predictive than any single feature. The combinatorial space of interactions is too large to engineer by hand.

Benchmark results

On RelBench (7 databases, 30 tasks, 103M+ rows), LightGBM with features engineered by a Stanford-trained data scientist achieves an average AUROC of 62.44 on classification tasks. This is the baseline that represents best-effort manual feature engineering with unlimited time.

Approach 2: Graph neural networks

Instead of flattening the relational structure, represent it as a graph and train a model that operates directly on that graph.

How it works

The relational database is converted into a heterogeneous graph. Each table row becomes a node, tagged with its entity type (customer, order, product). Each foreign key becomes an edge connecting the relevant nodes. Timestamps on rows create a temporal dimension, allowing the model to reason about sequences and recency.

A graph neural network then learns through message passing. In each layer, every node aggregates information from its neighbors, applies a learned transformation, and updates its representation. After k layers, each node encodes information from all entities within k hops. A 3-layer GNN on a customer node captures the customer's orders, the products in those orders, other customers who bought those products, and their behavior patterns.

What it gets right

Preserves the relational structure, no information loss from flattening
Automatically discovers multi-hop patterns that manual engineering misses
Temporal encoding captures sequence patterns that aggregates destroy
One architecture handles any prediction task on the same graph

What it requires

Specialized expertise. GNN architecture design, message passing schemes, neighborhood sampling, and temporal encoding are not standard data science skills. Most teams need to hire or upskill.
Graph construction pipeline. Converting a relational database to a graph requires an ETL pipeline that handles schema mapping, edge creation, temporal ordering, and incremental updates.
GPU infrastructure. GNN training on enterprise-scale graphs requires 1 to 8 GPUs depending on graph size. Training runs take hours to days.
Time. First production model takes 3 to 6 months for a team of 2 to 3 ML engineers.

Benchmark results

On RelBench, a supervised GNN (trained per task) achieves an average AUROC of 75.83 on classification tasks. That is a 13.4-point improvement over the LightGBM baseline. On regression tasks, the improvement averages 15 to 25% in MAE reduction.

Flatten and aggregate

12.3 hours per prediction task
878 lines of code per task
62.44 average AUROC on RelBench
Misses multi-hop and temporal patterns
Rebuild from scratch for every new task

Graph neural network

3-6 months for first model, then reusable
Architecture handles any task on the graph
75.83 average AUROC on RelBench
Captures multi-hop and temporal patterns
Requires GNN expertise and GPU infrastructure

PQL Query

PREDICT COUNT(orders.*, 0, 30) > 0
FOR EACH customers.customer_id
WHERE customers.signup_date < '2025-01-01'

A GNN learns from the full graph: customer -> orders -> products -> other customers -> their behavior. This multi-hop traversal captures patterns no flat table can represent.

Output

customer_id	churn_probability	hop_depth_signal	top_pattern
C-001	0.12	2-hop	Similar customers highly active
C-002	0.89	3-hop	Products they bought have 40% return rate
C-003	0.05	1-hop	Accelerating order frequency
C-004	0.67	2-hop	Support agent has low resolution rate

Approach 3: Relational foundation model

A foundation model pre-trained on relational data from thousands of diverse databases. It learns universal patterns during pre-training and applies them to any new database at inference time.

How it works

KumoRFM, the first relational foundation model, was pre-trained on data from over 5,000 relational databases spanning e-commerce, financial services, healthcare, manufacturing, and SaaS. During pre-training, it learned the universal patterns that recur across relational data: recency effects, frequency patterns, temporal dynamics, graph topology, cross-table signal propagation.

At inference time, you connect your relational database and write a prediction query in PQL (Predictive Query Language). The model reads your schema, constructs a temporal graph internally, and produces predictions. No training step, no feature engineering, no graph construction pipeline.

For higher accuracy on specific tasks, you can fine-tune the model on your data. Fine-tuning takes hours, not months, because the model already understands relational patterns and only needs to adapt to your specific schema and distribution.

What it gets right

Zero-shot predictions in seconds, no training or engineering required
Matches or exceeds supervised GNNs on most tasks without task-specific training
One model handles any prediction task on any relational database
No ML expertise required; PQL looks like SQL with a PREDICT clause

What it requires

Trust in a pre-trained model (similar to using GPT vs. training your own LLM)
Database connectivity for the model to read your schema and data
Willingness to validate predictions before production deployment

Benchmark results

On RelBench, KumoRFM zero-shot achieves an average AUROC of 76.71 on classification tasks, outperforming the supervised GNN baseline (75.83) without any task-specific training. Fine-tuned, it reaches 81.14 AUROC. On regression tasks, zero-shot MAE is competitive with trained GNNs, and fine-tuning reduces error by an additional 10 to 20%.

Head-to-head comparison

Here are the three approaches compared across the dimensions that matter for enterprise deployment.

Dimension	Flatten + aggregate	GNN	Foundation model
AUROC (RelBench avg)	62.44	75.83	76.71 (zero-shot) / 81.14 (fine-tuned)
Time to first prediction	2-6 months	3-6 months	Minutes (zero-shot) / hours (fine-tuned)
Team required	2-3 data scientists	2-3 ML engineers with GNN skills	Anyone who can write SQL
Per-task marginal cost	$150K-500K (team time)	$50K-200K (after first model)	Near-zero (same model, new query)
Multi-hop signals	Rarely captured	Captured automatically	Captured automatically
Temporal patterns	Lost in aggregation	Preserved with temporal encoding	Preserved with temporal encoding
Cold-start entities	No prediction possible	Predictions from graph structure	Predictions from graph structure

When to use each approach

Use flatten-and-aggregate when:

Your data genuinely lives in a single table
You have fewer than 3 interconnected tables
Regulatory requirements demand fully interpretable features
You are building a proof of concept with a 2-week deadline

Use a custom GNN when:

You have a unique graph structure that differs significantly from standard relational schemas
You need full control over the model architecture for research or competitive advantage
You have 2 to 3 ML engineers with GNN experience and 6 months of runway
You are building 1 to 2 high-stakes models, not a portfolio

Use a relational foundation model when:

You need predictions across 5 or more tasks on the same relational database
Time to value matters more than architectural control
Your team does not have GNN expertise and cannot hire for it
You want to evaluate graph ML's potential before committing to a custom build

The economics of each approach

The cost difference becomes dramatic when you move beyond a single prediction task.

Flatten-and-aggregate: Each new prediction task requires a new round of feature engineering. If you need 10 models, you need 10 rounds of SQL writing, feature selection, and model training. At $150K to $500K per model (team time, infrastructure, opportunity cost), a portfolio of 10 models costs $1.5M to $5M.

Custom GNN: The first model is expensive (6 months, $500K to $1M). But the graph and architecture are reusable. Each additional task costs $50K to $200K for fine-tuning and validation. 10 models cost $1M to $3M total.

Foundation model: Connect your database once. Each new prediction task is a new PQL query. The marginal cost per task approaches zero. 10 models cost the platform fee plus validation time. Total: $100K to $300K depending on data volume and query frequency.

Getting started

If your data lives in a relational database with 3 or more tables, start by benchmarking all three approaches on one prediction task. Use your existing flat model as the baseline, run a zero-shot foundation model prediction as the quick test, and evaluate whether the accuracy difference justifies changing your approach.

Most teams find that the foundation model matches or exceeds their manual pipeline on the first attempt. The ones that do not typically have extremely domain-specific data (proprietary sensor readings, custom encodings) where fine-tuning closes the gap.

The relational data prediction problem is solved. The question is no longer whether to move beyond flatten-and-aggregate, but which path makes sense for your team, your data, and your portfolio of use cases.

economics_by_task_count

Number of Tasks	Flatten+Aggregate	Custom GNN	Foundation Model
1 task (3-year)	$400K-800K	$500K-1M	$100K-150K
5 tasks (3-year)	$1.5M-3M	$1M-2M	$200K-350K
10 tasks (3-year)	$3M-5M	$1.5M-3M	$300K-420K
20 tasks (3-year)	$6M-10M	$2.5M-5M	$400K-600K

The cost advantage of foundation models scales with task count. At 10+ tasks, the gap is 8-10x.

Key Takeaways

1Enterprise data is inherently relational (10-50 connected tables), but ML models require flat input. This mismatch is the central bottleneck in enterprise ML, consuming 80% of data science time.
2Three approaches exist: flatten-and-aggregate (62.44 AUROC, $150K-500K/task, 3-6 months), graph neural networks (75.83 AUROC, $500K-1M first model, 3-6 months), and relational foundation models (76.71 AUROC zero-shot, near-zero marginal cost, minutes).
3The flatten approach loses information by design: multi-hop relationships, temporal sequences, and graph topology are destroyed during aggregation. GNNs and foundation models preserve the full relational structure.
4Foundation models achieve the highest accuracy (76.71 zero-shot, 81.14 fine-tuned) at the lowest cost because the marginal cost per additional prediction task approaches zero. One PQL query replaces months of feature engineering.
5The cost gap widens with scale: at 10 tasks, foundation models cost 8-10x less than flatten-and-aggregate. At 20+ tasks, the economic case is overwhelming.

Making Predictions on Relational Data: The Complete Guide

The problem: relational data vs. flat tables

Approach 1: Flatten and aggregate

How it works

What it gets right

What it misses

Benchmark results

Approach 2: Graph neural networks

How it works

What it gets right

What it requires

Benchmark results

Approach 3: Relational foundation model

How it works

What it gets right

What it requires

Benchmark results

Head-to-head comparison

When to use each approach

Use flatten-and-aggregate when:

Use a custom GNN when:

Use a relational foundation model when:

The economics of each approach

Getting started

Frequently asked questions

Why can't traditional ML models read relational databases directly?

What is the flatten-and-aggregate approach to relational ML?

How do graph neural networks handle relational data?

What is a relational foundation model?

Which approach to relational predictions is best for most enterprises?

Related topics

See it in action