Understanding RelBench: A Benchmark for Deep Learning on Relational Databases | Kumo.ai

01

Why Benchmarks Matter

Every breakthrough in machine learning has been preceded by a benchmark. ImageNet transformed computer vision. GLUE and SuperGLUE accelerated NLP. OGB (Open Graph Benchmark) catalyzed graph learning research. These benchmarks did more than measure progress. They defined the playing field, forced reproducibility, and gave researchers a common language for comparing methods.

Relational databases are the backbone of enterprise data. Over 80% of corporate data lives in relational databases with multiple interconnected tables: customers linked to transactions, transactions linked to products, products linked to categories. Yet until RelBench, there was no standard benchmark for evaluating ML methods on this type of data.

This is not a minor inconvenience. Without a benchmark, the field cannot answer basic questions. Are GNNs better than gradient-boosted trees on relational data? Does manual feature engineering still beat end-to-end deep learning? Which datasets are easy, and which remain unsolved? RelBench exists to answer these questions with rigor.

02

The Problem with Existing Evaluation

Before RelBench, researchers evaluating ML on relational data faced a fragmented landscape. The problems were systemic, not just inconvenient.

No standard datasets

Most papers either used proprietary data (impossible to reproduce) or constructed ad-hoc datasets by extracting a single table from a relational database and discarding the rest. This defeats the purpose: the whole challenge of relational data is that information is spread across multiple connected tables.

Inconsistent preprocessing

Even when two papers used the same underlying data source, they applied different joins, different feature engineering, different temporal cutoffs. Results were not comparable. A method that appeared to win on one version of a dataset might lose on another simply due to preprocessing differences.

Missing temporal realism

Most existing benchmarks ignored the temporal dimension entirely. They used random train/test splits, which allows models to train on future data when predicting past events. This is data leakage. In production, you only have access to data from the past when making predictions about the future.

No evaluation of human effort

Existing comparisons focused exclusively on predictive accuracy. But for enterprise adoption, the cost of building and maintaining a pipeline matters just as much. A model that scores 1% higher but requires 10x more engineering effort may not be the right choice. Before RelBench, no benchmark quantified this tradeoff.

What was missing before RelBench
Dimension	Before RelBench	With RelBench
Datasets	Ad-hoc, often single-table	7 multi-table databases, 51 tables total
Tasks	Custom per paper	30 standardized tasks across 3 types
Splits	Random (data leakage risk)	Temporal splits, no leakage
Metrics	Varies by paper	Fixed metric per task type
Human effort	Not measured	Hours and lines of code tracked
Reproducibility	Low (proprietary data, missing code)	Open source with full pipelines

03

What RelBench Is

RelBench is a benchmark for evaluating deep learning on relational databases, published at NeurIPS 2024. It provides 7 databases spanning diverse domains, 30 realistic prediction tasks, standardized train/val/test temporal splits, and baseline implementations for reproducible comparison.

The scale is substantial: 51 tables, over 103 million rows, and 489 columns across all databases combined. This is not a toy benchmark. The datasets come from real-world sources including Amazon product reviews, H&M retail transactions, Stack Overflow, Formula 1 racing, clinical trials, and online classified ads.

Design principles

The authors built RelBench around four principles that distinguish it from prior work:

Full relational structure preserved. Every database retains all of its tables and primary-foreign key relationships. Nothing is pre-joined or flattened.
Temporal integrity enforced. All tasks use timestamp-based splits. The training window strictly precedes validation, which strictly precedes test. This mirrors production deployment conditions.
Diverse domains and scales. The 7 databases range from 74K rows (rel-f1) to 41.3M rows (rel-event), across e-commerce, social media, sports, and healthcare.
End-to-end reproducibility. All data loading, preprocessing, model training, and evaluation code is open source. A researcher can replicate every baseline result from the paper.

1

Load Database

Download one of 7 relational databases with all tables and keys intact

→

2

Select Task

Choose from 30 tasks: entity classification, entity regression, or recommendation

→

3

Temporal Split

Automatic train/val/test split based on timestamps, preventing data leakage

→

4

Train & Evaluate

Run any method (GNN, tabular model, manual features) and compare on fixed metrics

The Relational Deep Learning (RDL) pipeline

RelBench evaluates a specific approach called Relational Deep Learning (RDL). The idea: convert a relational database into a heterogeneous graph, where each table row becomes a node and each primary-foreign key link becomes an edge. Then apply a GNN to learn over this graph structure, combined with a deep tabular model for initial node features.

This contrasts with the traditional approach, where a data scientist manually joins tables, engineers features, and feeds a flat feature table into XGBoost or LightGBM. RDL replaces all of that manual work with an end-to-end learned pipeline.

04

The Seven Databases

Each database in RelBench was chosen to represent a distinct domain, scale, and relational structure. Together they cover e-commerce, social platforms, sports analytics, and healthcare.

RelBench database overview
Database	Domain	Tables	Rows	Tasks
rel-amazon	E-commerce (Amazon reviews)	3	15.0M	7
rel-avito	Classifieds (Avito ads)	8	20.7M	4
rel-event	Social (event platform)	5	41.3M	3
rel-f1	Sports (Formula 1)	9	74K	3
rel-hm	Retail (H&M fashion)	3	16.7M	3
rel-stack	Social (Stack Overflow)	7	4.2M	5
rel-trial	Healthcare (clinical trials)	15	5.4M	5

rel-amazon (e-commerce, 3 tables, 15M rows)

Built from Amazon product review data. Three tables: products, reviews, and ratings. Tasks include predicting whether a product will receive negative reviews, forecasting product rating, and recommending products to users. This database tests a model's ability to capture user-product interaction patterns at scale.

rel-avito (classifieds, 8 tables, 20.7M rows)

Sourced from Avito, Russia's largest classified ads platform. Eight interconnected tables cover ads, users, search queries, and contextual features. With 20.7 million rows across 8 tables, this is one of the most structurally complex databases in RelBench. Tasks include predicting ad clicks and whether a user is a top seller.

rel-event (social, 5 tables, 41.3M rows)

The largest database by row count. Five tables model users, events, RSVPs, and event metadata on a social platform. With 41.3 million rows, this tests scalability. Tasks focus on predicting event attendance and user engagement.

rel-f1 (sports, 9 tables, 74K rows)

The smallest database but the most structurally rich, with 9 tables covering drivers, constructors, races, qualifying sessions, pit stops, lap times, and results spanning decades of Formula 1 data. Despite having only 74K rows, the dense interconnections between tables create a challenging learning problem. Tasks include predicting race positions and whether a driver will score a podium finish.

rel-hm (retail, 3 tables, 16.7M rows)

Real transaction data from H&M, the global fashion retailer. Three tables: customers, articles (clothing items), and transactions. Tasks include customer churn prediction and product recommendation. This dataset is particularly valuable because it represents the exact type of data that large retailers use for personalization.

rel-stack (social, 7 tables, 4.2M rows)

Built from Stack Overflow data. Seven tables cover users, posts, comments, votes, tags, badges, and post history. Tasks include predicting post engagement and user reputation changes. The hierarchical structure (posts have comments, comments have votes) tests multi-hop reasoning.

rel-trial (healthcare, 15 tables, 5.4M rows)

The most table-rich database, with 15 tables modeling clinical trial data: studies, conditions, interventions, outcomes, sponsors, facilities, and eligibility criteria. Tasks include predicting whether a trial will have adverse events and whether a study will be completed. This database is uniquely challenging because the relational structure encodes complex medical domain knowledge.

05

Task Types: Classification, Regression, Recommendation

RelBench defines 30 tasks across three categories. Each task is tied to a specific database, a specific entity table, and a specific target column or relation. The task type determines the evaluation metric.

Entity classification (12 tasks)

Binary or multi-class prediction for a row in an entity table. For example: will this Amazon product receive a negative review? Will this clinical trial be completed? Will this H&M customer churn? These are evaluated using AUROC (area under the ROC curve).

Classification tasks appear in every database. They range from straightforward (predicting whether a Stack Overflow user will receive a badge) to complex (predicting adverse events in clinical trials, which requires integrating information across 15 tables).

Entity regression (9 tasks)

Predicting a continuous value for an entity. Examples: what will this product's average rating be? How many engagement events will this user generate? What position will this F1 driver finish? Regression tasks are evaluated using mean absolute error (MAE).

Regression tasks tend to be harder than classification because the model must predict precise values rather than categories. The F1 driver position prediction task, for instance, requires the model to distinguish between finishing 1st and 3rd, not just “podium vs. not.”

Recommendation (9 tasks)

Given a user (or entity), rank a set of candidate items. For example: which products should we recommend to this H&M customer? Which events will this user RSVP to? Recommendation tasks are evaluated using MAP@K (mean average precision at K).

Recommendation is arguably the most practically important task type. It directly maps to production use cases in e-commerce, content platforms, and advertising. RelBench includes recommendation tasks in rel-amazon, rel-hm, and rel-event.

Task distribution across databases
Database	Classification	Regression	Recommendation	Total
rel-amazon	2	2	3	7
rel-avito	2	1	1	4
rel-event	2	1	0	3
rel-f1	2	1	0	3
rel-hm	1	1	1	3
rel-stack	2	1	2	5
rel-trial	1	2	2	5
Total	12	9	9	30

06

Baseline Results: What the Numbers Show

RelBench compares three baselines, each representing a fundamentally different approach to prediction on relational data.

The three baselines

LightGBM (automated features). A gradient-boosted tree model trained on features extracted by an automated featurization pipeline. This represents the “low effort, reasonable quality” approach. No manual work beyond setting up the pipeline.
Expert Data Scientist (manual features + LightGBM). A Stanford CS graduate student with five years of experience manually engineered features for each task. They wrote custom joins, aggregations, and transformations, then trained LightGBM on the resulting feature table. This represents the gold standard of traditional ML practice.
GNN (GraphSAGE). A graph neural network that operates directly on the relational database converted to a heterogeneous graph. Each row becomes a node, each foreign key becomes an edge. The GNN learns features end-to-end without manual engineering.

1

LightGBM (Auto)

Automated feature extraction, no human effort, fast setup

→

2

Data Scientist

12.3 hours + 878 lines of code per task, manual feature engineering

→

3

GNN (GraphSAGE)

~30 min setup, 56 lines of code, end-to-end learning on graph

The human effort gap

The most striking finding is the efficiency difference. The expert data scientist spent an average of 12.3 hours and wrote 878 lines of code (with a standard deviation of 77 lines) per task. The GNN pipeline required approximately 30 minutes of setup and 56 lines of code. That is a 24x reduction in time and a 15x reduction in code.

Predictive performance

Despite requiring a fraction of the effort, the GNN baseline consistently matched or outperformed the expert data scientist across most tasks. On several tasks, the GNN found patterns in the multi-table structure that the human expert missed entirely.

The automated LightGBM baseline (no human feature engineering) served as the lower bound. It performed reasonably on simple tasks but struggled on tasks requiring multi-hop reasoning across many tables. The gap between automated LightGBM and the GNN demonstrates the value of preserving relational structure rather than flattening it.

Effort comparison across baselines
Baseline	Setup Time	Code (lines)	Feature Engineering	Multi-table Learning
LightGBM (Auto)	Minutes	~50	Automated, shallow	Flattened joins
Expert Data Scientist	12.3 hours/task	878 per task	Manual, deep domain knowledge	Manual joins and aggregations
GNN (GraphSAGE)	~30 min	56	None (learned end-to-end)	Native graph message passing

07

How Relational Deep Learning Works

The GNN baseline in RelBench implements a pipeline called Relational Deep Learning (RDL). Understanding how it works clarifies why it can replace manual feature engineering.

Step 1: Database to graph

Every row in every table becomes a node. Every primary-foreign key relationship becomes an edge. A transactions row that references a customer_id creates an edge between the transaction node and the customer node. The result is a heterogeneous graph where different node types (customers, products, transactions) coexist with different edge types (purchased, reviewed, belongs_to).

Step 2: Initial node features

Each node's raw column values are converted into an initial feature vector using a deep tabular model. Numerical columns are normalized. Categorical columns are embedded. Text columns can be encoded with pretrained embeddings. Timestamps are encoded to capture both absolute position and relative recency.

Step 3: Message passing (GraphSAGE)

The GNN runs multiple rounds of message passing. In each round, a node aggregates information from its neighbors, effectively “looking up” related rows in other tables. After two rounds, a customer node has information from its transactions, and from the products in those transactions. After three rounds, it also has information from other customers who bought the same products.

1

Rows to Nodes

Each table row becomes a node; columns become raw features

→

2

Keys to Edges

Primary-foreign key links become edges in a heterogeneous graph

→

3

Tabular Encoding

Deep tabular model converts raw features into initial embeddings

→

4

GNN Message Passing

Nodes aggregate neighbor information across multiple hops

→

5

Prediction Head

Task-specific output layer for classification, regression, or ranking

Why this works

Manual feature engineering requires the data scientist to decide in advance which cross-table patterns to compute: average order value over 30 days, count of returns in the last quarter, most frequent product category. These are educated guesses. The GNN explores the full relational neighborhood of each entity and learns which patterns are predictive for the target task. It can discover interaction effects across tables that a human would never think to encode.

Temporal handling

RelBench enforces temporal constraints at the graph level. When constructing the training graph, only rows with timestamps before the training cutoff are included as nodes. Edges to future rows are removed. This ensures the GNN cannot learn from data that would not be available at prediction time.

08

Why RelBench Matters for the Field

RelBench is more than a collection of datasets. It establishes the infrastructure for an entire research direction. Here is what it enables.

Reproducible comparison of methods

For the first time, any researcher can download the same databases, run the same tasks with the same temporal splits, and compare results directly. This eliminates the “different dataset, different preprocessing” problem that plagued prior work. When a new method claims to outperform GNNs on relational data, RelBench provides the proving ground.

Quantified cost of manual feature engineering

The user study in RelBench is, to our knowledge, the first rigorous measurement of how much human effort traditional ML requires on relational data. The finding (12.3 hours and 878 lines of code per task by an expert) gives organizations a concrete number to weigh against automated alternatives. For a company with hundreds of prediction tasks, the manual approach is simply not scalable.

A path to foundation models for relational data

ImageNet did not just benchmark CNNs. It enabled transfer learning and eventually foundation models for vision. Similarly, RelBench provides the standardized evaluation needed to develop foundation models for relational data. A model pre-trained on multiple RelBench databases could potentially generalize to unseen databases, just as GPT generalizes to unseen text tasks.

Open challenges

The RelBench baselines leave significant room for improvement. Several tasks remain difficult for all methods. Some specific open problems:

Scaling to larger databases. The largest database (rel-event, 41.3M rows) is still modest compared to production databases with billions of rows. Can GNN methods scale further?
Few-shot and zero-shot prediction. Current baselines train from scratch on each task. Can pre-training across databases reduce the data needed for new tasks?
Complex schema reasoning. rel-trial has 15 tables. Production healthcare databases can have 100+ tables. How do methods handle increasingly complex relational schemas?
Dynamic graphs. Real databases change continuously. RelBench uses static snapshots with temporal splits, but production systems must handle streaming updates.

Impact on enterprise ML

The practical implication of RelBench is clear. If end-to-end deep learning on relational data can match or exceed manual feature engineering at 1/24th the time cost, the economics of enterprise ML change fundamentally. Teams can address more prediction tasks, iterate faster, and reduce dependency on scarce ML engineering talent.

RelBench provides the evidence base for this shift. Every result is reproducible, every dataset is public, and every baseline is open source. The benchmark is available at relbench.stanford.edu.