
Understanding Relational Deep Learning
How representing relational databases as temporal heterogeneous graphs with GNNs eliminates manual feature engineering and outperforms hand-crafted features across 30 benchmark tasks.
The Problem: ML on Relational Data
Most of the world's high-value enterprise data lives in relational databases. Customer tables link to transaction tables, which link to product tables, which link to category tables. A typical enterprise database has 10 to 50 interconnected tables with billions of rows, all connected through primary-foreign key relationships.
Yet no standard machine learning method operates directly on this relational structure. Every prediction task (churn, fraud, recommendations, demand forecasting) requires a manual pipeline: join tables with SQL, compute aggregate features, flatten everything into a single table, then train a model on that flat representation. This process is called feature engineering, and it dominates the time, cost, and failure modes of enterprise ML.
The RDL paper measured this precisely. A data scientist with a Stanford CS Master's degree and five years of industry experience needs an average of 12.3 hours and 878 lines of code (with a standard deviation of 77 lines) to build a single prediction task on a relational database. That covers writing SQL joins, designing aggregate features, handling temporal correctness, training a model, and evaluating it.
And the process restarts from scratch for every new question. Churn prediction needs different joins and aggregations than lifetime value estimation. Fraud detection needs a different feature set than product recommendations. There is no reuse across tasks.
The scale of the bottleneck
Feature engineering is not just slow. It is the single largest bottleneck in the enterprise ML lifecycle. Surveys consistently show that data scientists spend 60-80% of their time on data preparation and feature engineering, not on modeling. For organizations with hundreds of potential prediction tasks across their relational data, the current approach simply does not scale.
The RDL paper frames this as a fundamental mismatch: relational databases store data in a rich, interconnected structure, but ML algorithms require flat, single-table inputs. Bridging that gap manually is where all the cost lives.
Why Flat Tables Fail
The standard approach to ML on relational data works like this: write SQL queries to join multiple tables, compute aggregate features (counts, sums, averages over various time windows), and collapse everything into one row per entity. Then feed this flat table to XGBoost, LightGBM, or a similar model.
This flattening is not just inconvenient. It systematically destroys information that carries predictive signal.
What gets lost
- Multi-hop relationships. A customer's purchasing behavior is influenced by the products they bought, which categories those products belong to, which other customers bought similar products, and what those customers did next. These transitive patterns across 3, 4, or 5 table hops contain some of the strongest predictive signals. Flattening collapses them into aggregate counts and averages, destroying the relational structure.
- Temporal sequences. A flat table might record “user placed 5 orders in the last 30 days.” But it loses the sequence: did the user place all 5 in the first week and then go silent, or one per week with increasing order value? These temporal dynamics carry entirely different signals, but the aggregate (count = 5) is identical.
- Graph topology. Fraud rings share structural signatures: tightly connected clusters of accounts transacting with each other through intermediary entities. Supply chain cascades propagate through specific network topologies. These patterns live in the shape of connections, not in individual row attributes. Flattening destroys topology entirely.
- Combinatorial feature interactions. The paper demonstrates that no human can enumerate the combinatorial space of patterns across interconnected tables. A data scientist might create 50-100 hand-crafted features. But the actual space of useful multi-table, multi-hop, time-windowed combinations is orders of magnitude larger.
The coverage problem
Even given unlimited time, feature engineering cannot solve this. The issue is not speed but coverage. A data scientist explores a tiny corner of the possible feature space, guided by intuition and domain knowledge. The combinatorial space of multi-hop, multi-table, time-windowed patterns is too large for any human to enumerate.
The paper's benchmark results confirm this: across 7 databases and 30 tasks, automated approaches that operate on the full relational structure consistently find patterns that human feature engineers miss.
The RDL Paradigm Shift
Relational Deep Learning (RDL) inverts the standard approach. Instead of flattening the database to fit the algorithm, RDL adapts the algorithm to fit the database. The key insight: a relational database is already a graph. You just need to make that graph explicit.
The mapping
- Each row in each table becomes a node in the graph. A customer row is a customer node. A transaction row is a transaction node. A product row is a product node.
- Each primary-foreign key relationship becomes an edge. If a transaction row references a customer via a foreign key, that creates an edge between the transaction node and the customer node.
- Node features come directly from column values in the row: numerical columns, categorical columns, text columns, timestamps.
- Timestamps on rows define the temporal ordering, creating a temporal heterogeneous graph where edges are only valid in specific time windows.
The result is a temporal, heterogeneous graph. Temporal because edges carry time information and must respect causality (no future data leakage). Heterogeneous because nodes and edges have different types corresponding to different tables and relationships.
Why this representation is natural
This is not an artificial construction. The relational database schema already encodes these relationships through foreign keys. RDL simply makes the implicit graph structure explicit and operates on it directly. The E.F. Codd relational model from 1970, which defines how relational databases work, is fundamentally a specification of entities (tables) and their relationships (keys). RDL leverages exactly that structure.
Traditional ML Pipeline
Flatten, then learn
- +Well-understood tools (XGBoost, LightGBM)
- +Fast inference on flat features
- −12.3 hours per task
- −878 lines of code per task
- −Destroys relational structure
- −No reuse across tasks
- −Coverage limited by human intuition
RDL (Relational Deep Learning)
Learn on the graph directly
- +Preserves full relational structure
- +End-to-end learning, no feature engineering
- +~30 minutes, 56 lines of code per task
- +Captures multi-hop and temporal patterns
- +Automated and reproducible
- −Requires GNN infrastructure
- −Training per task (no zero-shot)
How RDL Works: Graph Neural Networks on Databases
Once the relational database is represented as a temporal heterogeneous graph, RDL applies Graph Neural Networks (specifically GraphSAGE in the paper's baseline) to learn node representations through message passing.
Database to Graph
Convert each table row to a node, each foreign-key link to an edge. Attach column values as node features and timestamps for temporal ordering.
Temporal Subgraph Sampling
For each prediction target (e.g., a customer node at time t), sample a local subgraph containing only edges from before time t. This prevents data leakage from future events.
Message Passing (GNN)
Each node aggregates information from its neighbors via learned functions. After K rounds, each node's embedding encodes K-hop relational context across multiple tables.
Prediction Head
The target node's learned embedding is fed to a task-specific head for classification, regression, or link prediction.
Message passing in detail
Message passing is the core mechanism. In each round, every node collects “messages” from its neighbors (the connected rows in other tables), transforms them through a learned function, and aggregates them into an updated representation. After K rounds of message passing, each node's embedding encodes information from its K-hop neighborhood in the graph.
For a concrete example, consider an e-commerce database with customers, transactions, and products tables. After 2 rounds of message passing, a customer node's embedding captures: (1) its own attributes (age, signup date, membership tier), (2) its transactions (amounts, dates, frequencies), and (3) the products involved in those transactions (categories, prices, ratings). This is a 2-hop pattern: customer → transaction → product.
With 3 rounds, the embedding additionally captures what other customers bought the same products, enabling collaborative filtering patterns to emerge automatically from the graph structure.
Temporal handling
Time is central to RDL. The paper enforces strict temporal correctness: when making a prediction at time t, only data from before t is visible. Edges are filtered by timestamp, and the subgraph sampling respects causal ordering. This is critical for two reasons:
- Preventing data leakage. In production, you never have future data. Training must reflect this constraint. Traditional feature engineering pipelines frequently introduce subtle temporal leakage through incorrect join logic. RDL's graph-based temporal filtering handles this systematically.
- Capturing temporal dynamics. The GNN processes edges in temporal order, so it can learn patterns like recency effects, seasonal trends, and behavioral acceleration or deceleration. These patterns are encoded in the temporal structure of the graph, not in hand-crafted time-window aggregates.
Heterogeneous node and edge types
Because different tables represent different entity types (customers, products, transactions), the graph is heterogeneous. RDL uses type-specific transformation functions: the message passing function for a customer-to-transaction edge differs from a transaction-to-product edge. This allows the GNN to learn type-aware relational patterns rather than treating all connections identically.
RelBench: A Benchmark for Relational Deep Learning
To enable rigorous, reproducible evaluation, the paper introduces RelBench: a benchmark suite for ML on relational databases. RelBench provides standardized datasets, task definitions, data loaders, and evaluation protocols. It is open-source and available as a Python package.
The benchmark suite
RelBench includes 7 relational databases spanning diverse domains, with a total of 30 prediction tasks across 51 tables and over 103 million rows:
| Database | Domain | Tables | Rows | Columns | Tasks |
|---|---|---|---|---|---|
| rel-amazon | E-commerce | 3 | 15.0M | 15 | 7 |
| rel-avito | E-commerce | 8 | 20.7M | 42 | 4 |
| rel-event | Social | 5 | 41.3M | 128 | 3 |
| rel-hm | E-commerce | 3 | 16.7M | 37 | 3 |
| rel-stack | Social | 7 | 4.2M | 52 | 5 |
| rel-trial | Medical | 15 | 5.4M | 140 | 5 |
| rel-f1 | Sports | 9 | 74K | 67 | 3 |
Why RelBench matters
Before RelBench, there was no standard way to evaluate ML methods on relational databases. Researchers would pick a single database, define their own features, split data in ad-hoc ways, and report results that were not reproducible or comparable. RelBench standardizes every aspect:
- Temporal train/val/test splits. All splits respect temporal ordering. Training data comes before validation data, which comes before test data. No random splitting that would leak future information.
- Task definitions. Each task specifies the entity to predict for, the target variable, the time horizon, and the evaluation metric. This eliminates ambiguity.
- Data loaders. The Python package handles downloading, preprocessing, and converting databases to PyTorch Geometric graph objects. A complete training pipeline requires approximately 56 lines of code.
- Evaluation protocol. Standardized metrics (AUROC for classification, MAE for regression, MAP@K for link prediction) with consistent evaluation procedures.
Task types
RelBench covers three categories of prediction tasks:
- Node classification. Predict a property of a specific entity (will this customer churn? will this clinical trial report adverse events?). Evaluated with AUROC.
- Node regression. Predict a numerical value for an entity (what is this customer's lifetime value? how many votes will this post receive?). Evaluated with MAE.
- Link prediction. Predict which entities will interact (which products will this user buy? which users will attend this event?). Evaluated with MAP@K.
Benchmark Results
The paper evaluates two primary approaches on the RelBench classification tasks: a LightGBM baseline (manual feature engineering + LightGBM) and the RDL baseline (GraphSAGE GNN on the relational graph). For recommendation tasks, the paper additionally compares GraphSAGE and NBFNet against LightGBM.
Classification results (AUROC, higher is better)
| Database | Task | LightGBM | RDL (GNN) |
|---|---|---|---|
| rel-amazon | user-churn | 52.22 | 70.42 |
| rel-amazon | item-churn | 62.54 | 82.81 |
| rel-avito | user-visits | 53.05 | 66.20 |
| rel-avito | user-clicks | 53.60 | 65.90 |
| rel-event | user-repeat | 53.05 | 76.89 |
| rel-event | user-ignore | 79.93 | 81.62 |
| rel-f1 | driver-dnf | 68.86 | 72.62 |
| rel-f1 | driver-top3 | 73.93 | 75.54 |
| rel-hm | user-churn | 55.21 | 69.88 |
| rel-stack | user-engagement | 63.39 | 90.59 |
| rel-stack | user-badge | 63.43 | 88.86 |
| rel-trial | study-outcome | 70.09 | 68.60 |
| Average | 62.44 | 75.83 |
On classification, the RDL (GNN) baseline outperforms LightGBM on 11 of 12 tasks. The improvement is especially pronounced on tasks with rich relational structure: the rel-stack user-engagement task shows a jump from 63.39 to 90.59 AUROC, a 43% relative improvement. The rel-stack user-badge task improves from 63.43 to 88.86, and rel-amazon user-churn leaps from 52.22 to 70.42. On average, RDL achieves 75.83 AUROC compared to 62.44 for LightGBM, a 21% relative improvement.
Recommendation results (MAP@k, higher is better)
| Database | Task | LightGBM | GraphSAGE | NBFNet |
|---|---|---|---|---|
| rel-amazon | user-item-purchase | 0.16 | 0.74 | 0.10 |
| rel-amazon | user-item-rate | 0.17 | 0.87 | 0.12 |
| rel-amazon | user-item-review | 0.09 | 0.47 | 0.09 |
| rel-avito | user-ad-visit | 0.06 | 0.02 | 3.66 |
| rel-hm | user-item-purchase | 0.38 | 0.80 | 2.81 |
| rel-stack | user-post-comment | 0.04 | 0.11 | 12.72 |
| rel-stack | post-post-related | 2.00 | 0.07 | 10.83 |
| rel-trial | cond-sponsor-run | 4.82 | 2.89 | 11.36 |
| rel-trial | site-sponsor-run | 8.40 | 10.70 | 19.00 |
| Average | 1.79 | 1.85 | 6.74 |
On recommendation tasks, NBFNet (a link-prediction-specific GNN) achieves the strongest overall results with an average MAP@k of 6.74, compared to 1.85 for GraphSAGE and 1.79 for LightGBM. NBFNet excels on tasks like rel-stack user-post-comment (12.72 vs 0.04 for LightGBM) and rel-trial site-sponsor-run (19.00 vs 8.40). GraphSAGE dominates on the rel-amazon tasks where collaborative filtering patterns are strong, achieving 0.74-0.87 MAP@k compared to 0.10-0.17 for LightGBM.
Time and effort comparison
| Metric | LightGBM (manual features) | RDL (GNN) |
|---|---|---|
| Time per task | 12.3 hours | ~30 minutes |
| Lines of code | 878 (SD: 77) | 56 |
| Feature engineering | Manual (50-100 features) | None (end-to-end) |
| Temporal handling | Manual SQL logic | Automatic (graph-based) |
The honest picture
RDL does not win every task. On the rel-trial study-outcome classification task, LightGBM edges out RDL (70.09 vs 68.60 AUROC). Domain expertise and well-crafted aggregate features can still matter, especially on tasks where simple statistics are strong predictors.
The consistent pattern: RDL's advantage grows with relational complexity. Tasks that depend on multi-hop patterns, temporal dynamics, and graph topology show the largest improvements. Tasks that can be well-served by simple aggregates show smaller gains.
Concrete Examples from RelBench Datasets
The abstract numbers become clearer with concrete examples from the benchmark datasets. Here are three that illustrate different aspects of RDL's advantage.
Stack Exchange: predicting user engagement (7 tables, 4.2M rows)
The rel-stack database contains Stack Exchange data across 7 tables: users, posts, comments, votes, tags, postLinks, and badges. The user-engagement task predicts whether a user will remain active.
A flat feature table for this task might include: total post count, average score per post, number of accepted answers, account age. But engagement depends on multi-hop relational patterns: the types of questions a user answers, the tags on those questions, the voting patterns of users who interact with those answers, and the community dynamics around specific topic areas.
The RDL GNN traverses users → posts → tags and users → posts → votes → voters, capturing these multi-hop interaction patterns automatically. Result: 90.59 AUROC vs 63.39 for LightGBM, a 43% relative improvement. The user-badge task tells a similar story: 88.86 vs 63.43.
Formula 1: predicting race outcomes (9 tables, 74K rows)
The rel-f1 database connects drivers, results, races, circuits, constructors, qualifying, lapTimes, pitStops, and driverStandings. The driver-top3 task predicts whether a driver will finish in the top 3.
The relational graph captures patterns that are difficult to hand-engineer: a driver's qualifying performance at circuits with specific characteristics, the constructor's reliability history at those track types, how pit stop strategies at similar circuits affected outcomes, and performance trends across the season. These are 3-4 hop patterns across 9 tables.
RDL achieves 75.54 AUROC compared to 73.93 for LightGBM. Even on a relatively small dataset (74K rows), the relational structure carries signal that flat features miss.
Amazon: product recommendations (3 tables, 15.0M rows)
The rel-amazon database contains products, reviews, and ratings across 3 tables. The user-item-purchase link prediction task recommends which products a user will purchase next.
Recommendation is inherently a graph problem: users who bought product A also bought product B. But the relational graph goes deeper. The GNN learns collaborative filtering patterns (user → purchase → product → purchase → other users), content-based patterns (product similarity through shared reviews), and temporal patterns (seasonal purchasing, trend acceleration) simultaneously.
GraphSAGE achieves 0.74 MAP@k compared to 0.16 for LightGBM, a 4.6x improvement in recommendation quality. On the user-item-rate task, GraphSAGE reaches 0.87 MAP@k vs 0.17 for LightGBM, a 5.1x improvement.
Practical Implications and What Comes Next
The RDL paper establishes a new baseline for how ML should operate on relational data. The practical implications span both immediate workflow changes and longer-term research directions.
For ML practitioners
- Feature engineering is no longer the default approach. The paper demonstrates that end-to-end learning on the relational graph matches or exceeds LightGBM with manual feature engineering on 11 of 12 classification tasks, at a fraction of the time and code.
- Temporal correctness becomes automatic. One of the most common and dangerous bugs in ML pipelines is temporal data leakage: accidentally using future information during training. RDL's graph-based temporal filtering handles this systematically, eliminating an entire category of subtle errors.
- 56 lines of code is the new baseline. Using the RelBench Python package and PyTorch Geometric, a complete RDL pipeline (data loading, graph construction, GNN training, evaluation) fits in 56 lines. This lowers the barrier from months of SQL and pipeline engineering to an afternoon.
For enterprise ML teams
- Task throughput increases dramatically. At 12.3 hours per task, a team of 5 data scientists can deliver roughly 2 tasks per person per week. At 30 minutes per task, the same team can explore and deliver 10x more prediction tasks, enabling rapid experimentation across the organization.
- The feature store bottleneck loosens. Feature stores exist because feature engineering is expensive and features should be reused. RDL reduces the need for this infrastructure because the model learns its own features end-to-end from the raw relational data.
The road from RDL to foundation models
The RDL paper demonstrated that GNNs on relational graphs outperform manual feature engineering. But each GNN must still be trained per task and per database. The natural next question: can you pre-train a single model on many relational databases and have it generalize to new databases and new tasks without retraining?
This is exactly the direction that led to KumoRFM (Kumo Relational Foundation Model), which builds on RDL's graph representation and adds schema-agnostic encoding, graph transformer attention, and in-context learning. RDL provided the proof that the graph representation is correct and that end-to-end learning on relational data works. KumoRFM extends this from per-task training to zero-shot generalization.
Open research directions
- Scaling to larger databases. The largest RelBench database (rel-event) has 41.3M rows. Enterprise databases can have billions. Efficient subgraph sampling and distributed GNN training are active areas of research.
- More complex schema patterns. Self-referencing tables, many-to-many junction tables, and deeply nested hierarchies introduce additional complexity that future work can address.
- Integration with existing infrastructure. Making RDL pipelines work natively with data warehouses (Snowflake, Databricks, BigQuery) and existing ML infrastructure is critical for enterprise adoption.
Try KumoRFM on your own data
Zero-shot predictions are free. Fine-tuning is available with a trial.