Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
Learn20 min read

Making Predictions on Relational Data: The Complete Guide

Your data lives in connected tables. Your models expect flat tables. Three approaches bridge the gap. Here is what the benchmark results say about each.

TL;DR

  • 1Enterprise data lives in 10-50 connected tables, but ML models need flat input. This mismatch is the central bottleneck: 80% of data science time, 12.3 hours and 878 lines of code per task.
  • 2Three approaches compared on RelBench (7 databases, 30 tasks, 103M+ rows): flatten-and-aggregate (62.44 AUROC), graph neural networks (75.83 AUROC), and relational foundation models (76.71 zero-shot, 81.14 fine-tuned).
  • 3Flattening destroys information by design: multi-hop relationships, temporal sequences, and graph topology are lost during aggregation. GNNs and foundation models preserve the full relational structure.
  • 4Economics scale dramatically: at 10 tasks over 3 years, flatten-and-aggregate costs $3M-5M, custom GNNs cost $1.5M-3M, and foundation models cost $300K-420K -- an 8-10x gap.
  • 5Use flatten-and-aggregate for single-table data with fewer than 3 tables. Use custom GNNs for 1-2 high-stakes models with a specialized team. Use a foundation model for 5+ tasks where time-to-value and cost matter most.

Every enterprise relational database has the same fundamental structure: entities stored in tables, connected by foreign keys. Customers link to orders. Orders link to products. Products link to categories. Patients link to diagnoses, prescriptions, and lab results. The data is inherently connected.

Every mainstream ML model has the same fundamental requirement: a flat table. One row per entity, one column per feature. No foreign keys, no joins, no multi-table structure.

This mismatch is the central bottleneck in enterprise ML. It explains why data scientists spend 80% of their time on feature engineering. It explains why each prediction task takes 3 to 6 months. And it explains why most companies have fewer than 10 ML models in production despite sitting on hundreds of potential use cases.

Three approaches exist to bridge this gap. Each makes different trade-offs between accuracy, speed, and engineering effort. This guide covers all three with benchmark results from RelBench, the standard evaluation suite for ML on relational databases.

The problem: relational data vs. flat tables

A typical enterprise database has 10 to 50 tables. The RelBench benchmark, which is the standard for evaluating ML on relational databases, includes datasets ranging from 3 tables (Amazon product data) to 15 tables (clinical trial data) with up to 41 million rows.

When you want to predict customer churn, the relevant information is scattered across multiple tables: customer demographics, order history, product details, support interactions, payment methods, website activity. Each table has a different granularity (one row per customer vs. one row per order vs. one row per page view), and the relationships between tables carry as much signal as the values within them.

The challenge is not just joining tables. It is deciding what to compute from those joins. Total spend? Average order value? Number of distinct products? Over what time window? And that is just for direct relationships. Multi-hop patterns (customers who bought products that other churning customers also bought) carry strong signal but require traversing 3 or 4 tables.

example_ecommerce_schema

TableRowsKey ColumnsConnects To
customers2.1Mcustomer_id, signup_date, segmentorders, support_tickets
orders18.4Morder_id, customer_id, product_id, amount, datecustomers, products
products145Kproduct_id, category, price, brandorders, reviews
support_tickets3.2Mticket_id, customer_id, category, resolvedcustomers
reviews8.7Mreview_id, product_id, customer_id, rating, dateproducts, customers

A typical e-commerce relational database. The prediction challenge: flatten 5 tables with 32M+ rows into one row per customer.

Approach 1: Flatten and aggregate

This is the default approach used by 90% of enterprise data science teams. You write SQL to join tables, compute aggregate features, and produce a flat table that a standard model (XGBoost, logistic regression, neural network) can consume.

To make the comparison concrete, here is the same prediction task (will this customer reorder within 30 days?) evaluated across all three approaches on the same data.

same_task_three_approaches

CustomerFlatten+AggregateGNNFoundation ModelActual
C-001 (steady buyer)0.620.780.81Reordered (day 12)
C-002 (new, 1 order)0.50 (no signal)0.710.74Reordered (day 8)
C-003 (declining)0.580.340.29Did not reorder
C-004 (competitor product buyer)0.610.220.19Did not reorder

C-002 is a cold-start customer (1 order). The flat model gives 0.50 (coin flip). GNN/FM use graph connections to predict 0.71/0.74. C-004 bought products that other churners also bought (multi-hop signal) -- only GNN/FM detect this.

How it works

For each prediction target, a data scientist writes JOIN queries across the relevant tables, applies aggregation functions (COUNT, SUM, AVG, MAX, MIN, COUNT DISTINCT) over various time windows (7 days, 30 days, 90 days, all time), and outputs a single table with one row per entity and hundreds of computed columns.

A typical churn model might have 200 to 500 features engineered from 5 to 10 source tables. The SQL for this runs 500 to 2,000 lines. The Stanford study measured the average cost at 12.3 hours and 878 lines of code per prediction task, even for experienced data scientists.

flattened_feature_table_example

customer_idorders_30davg_amountdays_since_lasttickets_openavg_review
C-0013$87.50504.2
C-0020$0.009422.1
C-0037$142.30104.8
C-0041$34.002813.5

What XGBoost sees after flattening. Missing: order sequence, ticket categories, product return rates, similar-customer behavior, temporal acceleration.

What it gets right

  • Uses battle-tested ML models (XGBoost, LightGBM) that data scientists understand well
  • Interpretable features that business stakeholders can validate
  • Low-latency inference (no graph lookups at serving time)
  • Mature tooling for model monitoring, A/B testing, and deployment

What it misses

  • Multi-hop patterns. A customer's churn risk depends on the behavior of similar customers 2 to 3 hops away. Nobody writes these features because the join paths are too complex to enumerate manually.
  • Temporal sequences. Aggregating "5 orders in 30 days" destroys the sequence: were they accelerating, decelerating, or clustered? The ordering carries signal that averages erase.
  • Feature interactions. The combination of high return rate + declining order value + increasing support tickets is more predictive than any single feature. The combinatorial space of interactions is too large to engineer by hand.

Benchmark results

On RelBench (7 databases, 30 tasks, 103M+ rows), LightGBM with features engineered by a Stanford-trained data scientist achieves an average AUROC of 62.44 on classification tasks. This is the baseline that represents best-effort manual feature engineering with unlimited time.

Approach 2: Graph neural networks

Instead of flattening the relational structure, represent it as a graph and train a model that operates directly on that graph.

How it works

The relational database is converted into a heterogeneous graph. Each table row becomes a node, tagged with its entity type (customer, order, product). Each foreign key becomes an edge connecting the relevant nodes. Timestamps on rows create a temporal dimension, allowing the model to reason about sequences and recency.

A graph neural network then learns through message passing. In each layer, every node aggregates information from its neighbors, applies a learned transformation, and updates its representation. After k layers, each node encodes information from all entities within k hops. A 3-layer GNN on a customer node captures the customer's orders, the products in those orders, other customers who bought those products, and their behavior patterns.

What it gets right

  • Preserves the relational structure, no information loss from flattening
  • Automatically discovers multi-hop patterns that manual engineering misses
  • Temporal encoding captures sequence patterns that aggregates destroy
  • One architecture handles any prediction task on the same graph

What it requires

  • Specialized expertise. GNN architecture design, message passing schemes, neighborhood sampling, and temporal encoding are not standard data science skills. Most teams need to hire or upskill.
  • Graph construction pipeline. Converting a relational database to a graph requires an ETL pipeline that handles schema mapping, edge creation, temporal ordering, and incremental updates.
  • GPU infrastructure. GNN training on enterprise-scale graphs requires 1 to 8 GPUs depending on graph size. Training runs take hours to days.
  • Time. First production model takes 3 to 6 months for a team of 2 to 3 ML engineers.

Benchmark results

On RelBench, a supervised GNN (trained per task) achieves an average AUROC of 75.83 on classification tasks. That is a 13.4-point improvement over the LightGBM baseline. On regression tasks, the improvement averages 15 to 25% in MAE reduction.

Flatten and aggregate

  • 12.3 hours per prediction task
  • 878 lines of code per task
  • 62.44 average AUROC on RelBench
  • Misses multi-hop and temporal patterns
  • Rebuild from scratch for every new task

Graph neural network

  • 3-6 months for first model, then reusable
  • Architecture handles any task on the graph
  • 75.83 average AUROC on RelBench
  • Captures multi-hop and temporal patterns
  • Requires GNN expertise and GPU infrastructure

PQL Query

PREDICT COUNT(orders.*, 0, 30) > 0
FOR EACH customers.customer_id
WHERE customers.signup_date < '2025-01-01'

A GNN learns from the full graph: customer -> orders -> products -> other customers -> their behavior. This multi-hop traversal captures patterns no flat table can represent.

Output

customer_idchurn_probabilityhop_depth_signaltop_pattern
C-0010.122-hopSimilar customers highly active
C-0020.893-hopProducts they bought have 40% return rate
C-0030.051-hopAccelerating order frequency
C-0040.672-hopSupport agent has low resolution rate

Approach 3: Relational foundation model

A foundation model pre-trained on relational data from thousands of diverse databases. It learns universal patterns during pre-training and applies them to any new database at inference time.

How it works

KumoRFM, the first relational foundation model, was pre-trained on data from over 5,000 relational databases spanning e-commerce, financial services, healthcare, manufacturing, and SaaS. During pre-training, it learned the universal patterns that recur across relational data: recency effects, frequency patterns, temporal dynamics, graph topology, cross-table signal propagation.

At inference time, you connect your relational database and write a prediction query in PQL (Predictive Query Language). The model reads your schema, constructs a temporal graph internally, and produces predictions. No training step, no feature engineering, no graph construction pipeline.

For higher accuracy on specific tasks, you can fine-tune the model on your data. Fine-tuning takes hours, not months, because the model already understands relational patterns and only needs to adapt to your specific schema and distribution.

What it gets right

  • Zero-shot predictions in seconds, no training or engineering required
  • Matches or exceeds supervised GNNs on most tasks without task-specific training
  • One model handles any prediction task on any relational database
  • No ML expertise required; PQL looks like SQL with a PREDICT clause

What it requires

  • Trust in a pre-trained model (similar to using GPT vs. training your own LLM)
  • Database connectivity for the model to read your schema and data
  • Willingness to validate predictions before production deployment

Benchmark results

On RelBench, KumoRFM zero-shot achieves an average AUROC of 76.71 on classification tasks, outperforming the supervised GNN baseline (75.83) without any task-specific training. Fine-tuned, it reaches 81.14 AUROC. On regression tasks, zero-shot MAE is competitive with trained GNNs, and fine-tuning reduces error by an additional 10 to 20%.

Head-to-head comparison

Here are the three approaches compared across the dimensions that matter for enterprise deployment.

DimensionFlatten + aggregateGNNFoundation model
AUROC (RelBench avg)62.4475.8376.71 (zero-shot) / 81.14 (fine-tuned)
Time to first prediction2-6 months3-6 monthsMinutes (zero-shot) / hours (fine-tuned)
Team required2-3 data scientists2-3 ML engineers with GNN skillsAnyone who can write SQL
Per-task marginal cost$150K-500K (team time)$50K-200K (after first model)Near-zero (same model, new query)
Multi-hop signalsRarely capturedCaptured automaticallyCaptured automatically
Temporal patternsLost in aggregationPreserved with temporal encodingPreserved with temporal encoding
Cold-start entitiesNo prediction possiblePredictions from graph structurePredictions from graph structure

When to use each approach

Use flatten-and-aggregate when:

  • Your data genuinely lives in a single table
  • You have fewer than 3 interconnected tables
  • Regulatory requirements demand fully interpretable features
  • You are building a proof of concept with a 2-week deadline

Use a custom GNN when:

  • You have a unique graph structure that differs significantly from standard relational schemas
  • You need full control over the model architecture for research or competitive advantage
  • You have 2 to 3 ML engineers with GNN experience and 6 months of runway
  • You are building 1 to 2 high-stakes models, not a portfolio

Use a relational foundation model when:

  • You need predictions across 5 or more tasks on the same relational database
  • Time to value matters more than architectural control
  • Your team does not have GNN expertise and cannot hire for it
  • You want to evaluate graph ML's potential before committing to a custom build

The economics of each approach

The cost difference becomes dramatic when you move beyond a single prediction task.

Flatten-and-aggregate: Each new prediction task requires a new round of feature engineering. If you need 10 models, you need 10 rounds of SQL writing, feature selection, and model training. At $150K to $500K per model (team time, infrastructure, opportunity cost), a portfolio of 10 models costs $1.5M to $5M.

Custom GNN: The first model is expensive (6 months, $500K to $1M). But the graph and architecture are reusable. Each additional task costs $50K to $200K for fine-tuning and validation. 10 models cost $1M to $3M total.

Foundation model: Connect your database once. Each new prediction task is a new PQL query. The marginal cost per task approaches zero. 10 models cost the platform fee plus validation time. Total: $100K to $300K depending on data volume and query frequency.

Getting started

If your data lives in a relational database with 3 or more tables, start by benchmarking all three approaches on one prediction task. Use your existing flat model as the baseline, run a zero-shot foundation model prediction as the quick test, and evaluate whether the accuracy difference justifies changing your approach.

Most teams find that the foundation model matches or exceeds their manual pipeline on the first attempt. The ones that do not typically have extremely domain-specific data (proprietary sensor readings, custom encodings) where fine-tuning closes the gap.

The relational data prediction problem is solved. The question is no longer whether to move beyond flatten-and-aggregate, but which path makes sense for your team, your data, and your portfolio of use cases.

economics_by_task_count

Number of TasksFlatten+AggregateCustom GNNFoundation Model
1 task (3-year)$400K-800K$500K-1M$100K-150K
5 tasks (3-year)$1.5M-3M$1M-2M$200K-350K
10 tasks (3-year)$3M-5M$1.5M-3M$300K-420K
20 tasks (3-year)$6M-10M$2.5M-5M$400K-600K

The cost advantage of foundation models scales with task count. At 10+ tasks, the gap is 8-10x.

Frequently asked questions

Why can't traditional ML models read relational databases directly?

Traditional ML models (XGBoost, random forests, logistic regression, neural networks) require a flat input: one row per entity, one column per feature. Relational databases store data across multiple tables linked by foreign keys. The structure cannot be directly consumed by these models. Someone or something must first flatten the relational structure into a single table, which is what feature engineering does.

What is the flatten-and-aggregate approach to relational ML?

Flatten-and-aggregate is the traditional approach: join multiple tables using SQL, compute aggregations (count, sum, average, max, min) over various time windows, and produce a single flat table of features. This is how most enterprise ML teams work today. A Stanford study measured the cost at 12.3 hours and 878 lines of code per prediction task.

How do graph neural networks handle relational data?

GNNs represent a relational database as a graph: rows become nodes, foreign keys become edges. The GNN learns node embeddings through message passing, where each node aggregates information from its neighbors across multiple hops. This preserves the relational structure and lets the model discover cross-table patterns that manual feature engineering misses.

What is a relational foundation model?

A relational foundation model is a neural network pre-trained on data from thousands of diverse relational databases. Like GPT for text, it learns universal patterns in relational data (recency, frequency, temporal dynamics, graph topology) during pre-training. At inference time, it can make predictions on any new relational database without task-specific training.

Which approach to relational predictions is best for most enterprises?

For most enterprises running 5 or more prediction tasks on relational databases, a relational foundation model provides the best ROI. It matches or exceeds GNN accuracy (76.71 vs 75.83 AUROC on RelBench zero-shot), requires no feature engineering, no ML expertise, and delivers predictions in seconds rather than months. For a single high-stakes model, a custom GNN may be justified.

See it in action

KumoRFM delivers predictions on relational data in seconds. No feature engineering, no ML pipelines. Try it free.