Understanding PluRel: How Synthetic Data Unlocks Scaling Laws for Relational Foundation Models | Kumo.ai

01

The Data Problem

Scaling laws are one of the most important discoveries in modern AI. For large language models, the relationship is well-established: more data and more compute predictably produce better models. GPT-3 was trained on 300 billion tokens. GPT-4 on trillions. Each jump in data scale brought measurable, predictable gains.

Relational Foundation Models (RFMs) like KumoRFM should follow the same pattern. They learn from relational databases, the interconnected tables that power enterprise systems: customers, transactions, products, accounts. In theory, training on more diverse databases should produce models that generalize better to new schemas and new tasks.

But there is a hard constraint. Real-world relational databases are almost never publicly available. Enterprise data contains customer records, financial transactions, medical histories, supply chain details. Privacy laws (GDPR, HIPAA), business confidentiality, and regulatory requirements make it nearly impossible to assemble large-scale pretraining corpora from real relational databases.

The RelBench benchmark, the standard for evaluating RFMs, contains only 7 databases. Compare that to the trillions of tokens available for LLM training. This scarcity has a direct consequence: no one has been able to study whether scaling laws even exist for relational foundation models. You can't measure the relationship between data scale and model performance when you don't have enough data to vary the scale.

02

Why Can't Existing Methods Solve This?

Generating synthetic tabular data is not new. But existing methods fall short for relational databases in specific, measurable ways.

Single-table generators

Structural Causal Models (SCMs), which are the standard for synthetic tabular data, generate one table at a time. They can capture column distributions, correlations, and causal relationships within a table. But a relational database is not a collection of independent tables. The primary-foreign key relationships between tables determine how information flows, how rows connect, and what patterns emerge at the multi-table level.

Generating tables independently and then stitching them together with random foreign keys does not work. The connectivity pattern between tables (which rows in a parent table are referenced by which rows in a child table) controls the locality of information at multiple levels: at the table level, at the row level, and across the entire database. Random connectivity destroys these patterns.

GAN and diffusion-based generators

GAN-based methods (like conditional-TGAN) and diffusion models (like ClavaDDPM, Reldiff) can capture characteristics of real databases. But they need an existing real-world database as input. They generate variations of what already exists. They cannot synthesize novel database schemas from scratch, which is exactly what you need to scale up the number and diversity of training databases.

The multi-table challenge

The core difficulty is that a relational database has structure at three levels simultaneously:

Schema level. The number of tables, their relationships (which table references which), and the topology of these connections.
Connectivity level. The specific row-to-row foreign key links between tables. Real databases show hierarchical clustering here: some parent rows are referenced by thousands of children, others by very few.
Feature level. The actual cell values in each table, which exhibit temporal patterns, cross-table correlations, and causal dependencies.

No existing method models all three levels together. That is the gap PluRel fills.

03

How PluRel Works

PluRel generates complete relational databases from scratch in three stages. Each stage addresses one level of the multi-table structure.

1

Stage 1: Schema

Sample a directed acyclic graph (DAG) to define tables and their relationships.

→

2

Stage 2: Connectivity

Populate foreign key columns using hierarchical bipartite graphs.

→

3

Stage 3: Features

Generate cell values via Structural Causal Models with temporal patterns.

Stage 1: Schema generation via directed graphs

The database schema is sampled as a random directed acyclic graph (DAG). Nodes represent tables, edges represent primary-foreign key relationships. The graph is drawn from families that model different real-world patterns:

Barabasi-Albert graphs model databases with hub tables and preferential connectivity (one central table referenced by many others, like a users table).
Reverse Random-Tree graphs model strictly hierarchical schemas (like organizational structures).
Watts-Strogatz graphs model databases with table clusters (like a system with separate modules that share a few cross-references).

Tables with outgoing edges (children reference them) are classified as entity tables (e.g., users, products). Tables without outgoing edges are activity tables (e.g., transactions, clicks). Each table gets randomly sampled metadata: number of rows (500-1,000 for entity tables, 2,000-5,000 for activity tables), number of feature columns (3-40), and column types.

Stage 2: Foreign key generation via bipartite graphs

This is where PluRel differs most from naive approaches. Real databases show hierarchical clustering in their foreign key connectivity. Some users have thousands of transactions; others have very few. Products in the same category are bought by overlapping sets of customers.

PluRel models this using a Hierarchical Stochastic Block Model (HSBM). For each pair of connected tables, it partitions rows in both tables into hierarchical blocks (clusters), then samples foreign key links with probabilities that depend on block membership. Rows in matching blocks are highly likely to link (probability ~0.9); rows in different blocks rarely link (probability ~0.001-0.002). This creates the realistic clustering pattern where related rows preferentially connect.

Stage 3: Feature generation via Structural Causal Models

Each table gets its own SCM (Structural Causal Model), a directed graph where nodes represent feature columns and edges represent causal relationships. The SCM captures:

Temporal patterns. Feature values in activity tables follow trends, cyclical patterns, and bounded fluctuations. The paper formally defines these: trend(r) captures power-law growth, cycle(r) captures periodicity, and fluc(r) captures random noise. This avoids the unrealistic assumption that all rows are independent and identically distributed.
Cross-table dependencies. Feature values in child tables depend on feature values in parent tables. A transaction's amount might depend on the product's price and the customer's spending pattern. The SCM propagates information from parent rows to child rows through the foreign key links.
Type diversity. Each feature column is randomly assigned as numeric or categorical. Numeric values are generated through continuous functions; categorical values are sampled from softmax distributions over temporal functions.

The causal graphs themselves are sampled from diverse families (Layered, Erdos-Renyi, Barabasi-Albert, Random-Tree, Reverse Random-Tree) to produce a wide range of causal structures. Each node's value is computed via a randomly initialized MLP, which means each synthetic database has a unique, non-trivial data distribution.

Computational efficiency

PluRel is CPU-only and lightweight. Generating a synthetic database takes roughly 14-17 seconds per table on a single thread, with peak memory under 1 GB even for databases with 80 tables. This means generating 1,024 synthetic databases is practical, not a supercomputer-scale operation.

Generation time and memory for varying database sizes. CPU-only, single-threaded.
Tables	Latency (sec)	Peak Memory (GB)
10	147.5 ± 66	0.45 ± 0.01
20	267.0 ± 129	0.55 ± 0.04
40	584.3 ± 252	0.77 ± 0.06
80	1,368.6 ± 950	0.91 ± 0.11

04

Scaling Laws for RFMs

With PluRel generating unlimited synthetic databases, the paper can finally answer the question: do relational foundation models exhibit scaling laws?

The experiments vary two axes independently:

N (diversity): the number of synthetic databases, from 8 to 1,024.
S (size): the total pretraining tokens extracted from those databases, from 0.5 billion to 32 billion.

The model being pretrained is a 12-layer Relational Transformer (RT) using masked token prediction (MTP). Each configuration is trained from scratch on 1 Blackwell B200 GPU in about 3 hours, and evaluated on 100 held-out synthetic databases.

The power laws

The validation loss follows clean power-law relationships on both axes:

Diversity scaling: L(N) = 0.07 · N^(-0.38) + 0.36
Size scaling: L(S) = 0.025 · S^(-0.48) + 0.36

Both axes matter, and they are not interchangeable. Scaling N for fixed S, or scaling S for fixed N, both produce non-monotonic (U-shaped) loss curves. Intuitively: increasing diversity for a fixed dataset size leads to underfitting (each database is seen too few times), and increasing size for fixed diversity leads to overfitting (the model memorizes the limited set of databases).

The optimal frontier requires scaling both N and S together. The paper notes that unlike LLM scaling laws, a simple joint power-law (as used by Hoffmann et al., 2022) does not apply here because the loss is non-monotonic in each axis individually. The solution: fit separate power laws for each axis at its optimal frontier, then scale both together.

05

Transfer to Real Databases

Scaling laws on synthetic data are interesting, but the question that matters is: does training on more synthetic data produce models that perform better on real-world tasks?

The paper evaluates zero-shot performance on RelBench (6 real-world databases, 18 tasks) across three pretraining strategies:

Real only: Pretrain on RelBench using leave-one-DB-out (train on 5 databases, evaluate on the held-out 6th).
Synthetic + Real: Pretrain on PluRel synthetic data first, then continue pretraining on the same RelBench data.
Synthetic only: Pretrain on PluRel data alone.

Classification results (AUROC)

Zero-shot classification results. Higher is better. Majority baseline is 50.0.
Dataset	Task	Real Only	Synth + Real	Gain	Synth Only
rel-amazon	user-churn	64.2	65.0	+0.8	64.4
rel-hm	user-churn	67.4	66.0	-1.4	63.7
rel-stack	user-badge	80.0	82.0	+2.0	81.4
rel-stack	user-engage	78.9	86.2	+7.4	82.4
rel-amazon	item-churn	67.6	72.5	+4.9	71.0
rel-avito	user-visits	57.2	63.4	+6.2	63.5
rel-avito	user-clicks	54.7	47.9	-6.8	45.9
rel-trial	study-out	54.4	51.8	-2.6	53.8
rel-f1	driver-dnf	80.7	81.0	+0.3	76.7
rel-f1	driver-top3	86.9	88.4	+1.5	82.6
Mean	—	69.2	70.4	+1.2	68.5

Regression results (R\u00b2)

Zero-shot regression results. Higher is better. Mean baseline is 0.0.
Dataset	Task	Real Only	Synth + Real	Gain	Synth Only
rel-hm	item-sales	16.0	20.0	+4.0	4.4
rel-amazon	user-ltv	14.5	18.5	+4.0	9.8
rel-amazon	item-ltv	35.3	40.5	+5.2	10.7
rel-stack	post-votes	22.3	25.5	+3.2	15.7
rel-trial	site-succ	33.7	38.6	+5.0	38.3
rel-trial	study-adv	1.9	1.6	-0.3	-0.8
rel-f1	driver-pos	54.3	55.5	+1.2	41.3
rel-avito	ad-ctr	3.1	4.9	+1.9	2.5
Mean	—	22.6	25.7	+3.0	15.2

What the results show

Synthetic + Real consistently beats Real only. On average, the combined approach gains +1.2% absolute AUROC on classification and +3.0% absolute R² on regression. On individual tasks, gains reach up to +7.4% AUROC (rel-stack/user-engage) and +5.2% R² (rel-amazon/item-ltv).

Regression tasks benefit the most. Synthetic + Real outperforms Real only on 7 out of 8 regression tasks, suggesting that synthetic relational diversity is especially valuable for learning continuous-valued patterns.

Synthetic only underperforms on most tasks. This is an important finding: synthetic data alone is not enough for robust zero-shot transfer to real databases. The model needs continued pretraining on real data to align with real-world distributions. Synthetic data provides a strong foundation, but real data provides the final calibration.

On a few classification tasks (rel-hm/user-churn, rel-avito/ user-clicks, rel-trial/study-out), the combined approach performs slightly worse. The paper hypothesizes this is due to PluRel's current lack of textual and column-semantic information, since row values are generated without meaningful column names.

06

The Relational Transformer

PluRel is a data generation framework, not a model architecture. It is evaluated using the Relational Transformer (RT), a 12-layer transformer designed for relational data. Understanding the RT helps explain why PluRel's synthetic data is effective.

Cell-level tokenization

The RT treats each cell in a relational database as a single token, represented as a triple: (value, column_name, table_name). Numeric, boolean, and datetime cells get type-specific normalization. Text cells are embedded via a frozen text encoder. Column and table names are embedded via a pretrained sentence encoder, which lets the model leverage schema semantics.

Relational Attention

Standard self-attention treats all tokens equally. The RT introduces structured attention masks with three layers:

Column Attention: tokens attend only within the same column, capturing column-level statistics and cross-row patterns.
Feature Attention: tokens attend within the same row and to parent rows linked via foreign keys, aggregating attributes of related entities.
Neighbor Attention: tokens attend to child rows linked via primary-foreign keys, analogous to message passing in graph neural networks.

The pretraining objective is Masked Token Prediction (MTP): mask random cells and predict their values. For numeric targets, the loss is Huber loss. For boolean targets, CrossEntropy loss. This unified objective naturally covers both property prediction (masking cells in existing tables) and forecasting (masking cells in future-looking task tables).

Why PluRel's structure matters for the RT

The RT's relational attention layers learn which cross-table patterns are predictive. If the synthetic training data has realistic connectivity (which rows link to which) and realistic temporal correlations (how features evolve over time), the attention patterns learned during pretraining will transfer to real databases. This is exactly what PluRel's HSBM connectivity and temporal SCMs provide. Random connectivity and i.i.d. features would produce attention patterns that do not transfer.

07

Engineering Improvements

The paper also introduces two architectural improvements to the RT that were necessary for stable synthetic pretraining.

Query-Key Normalization

When pretraining on diverse synthetic databases with multi-modal cell types (numeric, boolean, text), the RT showed sensitivity to random initialization. Different random seeds produced AUROC differences as large as 10.5% on the same task (rel-amazon/user-churn).

The fix: apply RMSNorm to the query and key vectors in the attention layer before computing dot products. This is a known technique (Query-Key Normalization) but its application to relational transformers is new. With QK-Norm, the cross-seed AUROC difference drops from 10.5% to 2.2%.

Effect on baselines

QK-Norm also improves the Real-only baseline. Without it, RT suffers from early overfitting, especially on binary classification tasks. The paper reports that removing QK-Norm decreases the baseline mean AUROC by 3.1% and mean R² by 3.7% absolute. This means the improved baselines in the paper are already stronger than previously published RT results.

08

Limitations and What Comes Next

The paper is transparent about PluRel's current limitations.

No self-referencing foreign keys

PluRel generates foreign keys between different tables only. Some real databases have self-referencing foreign keys (e.g., a posts table where ParentID references another row in the same posts table). PluRel cannot currently generate these patterns.

No schema semantics

PluRel generates generic column names (feature_1, feature_2) rather than meaningful names like price or age. The RT uses column and table names as part of its token embeddings, so this lack of semantic information may explain the slight performance drops on tasks where column semantics are important. The paper hypothesizes this is why Synthetic + Real underperforms Real only on a few tasks.

Limited data types

The current implementation generates numeric and categorical columns only. Real databases also contain text, images, geospatial data, JSON, and encrypted fields. The framework's SCM mechanism can be extended to these types, but this is not done in the current work.

What comes next

The paper identifies several directions: adding column semantics to synthetic databases, extending to text and other modalities, semi-synthetic data augmentation (starting from real schemas and generating synthetic rows), and exploring joint model-and-data scaling laws. The framework and all generated data are open-source at github.com/snap-stanford/plurel.