Foundation models have transformed every major data modality. GPT-4 handles text. Stable Diffusion and DALL-E handle images. Whisper handles audio. AlphaFold handles protein structures. Each followed the same pattern: a large model pre-trained on diverse data that generalizes to new tasks without task-specific training.
One modality has been conspicuously absent from this revolution: structured data. The tables, rows, and columns that store 80% of enterprise data. The relational databases that run every bank, every retailer, every hospital. Until recently, if you wanted predictions from this data, you still had to flatten tables, engineer features, and train a model from scratch. Every single time.
That gap has closed. Relational foundation models now exist. They are pre-trained on billions of rows across thousands of databases and generalize to new relational databases zero-shot. The implications for enterprise ML are as large as GPT's implications for text.
Why the relational data gap existed
Foundation models require two things: a universal representation that works across different datasets, and enough training data in that representation to learn generalizable patterns.
For text, the representation is a token sequence. Every sentence, document, and book can be tokenized the same way. The entire internet provides training data. For images, the representation is a pixel grid. Every photo, painting, and screenshot uses the same format.
Structured data had neither property. Every relational database has a different schema: different tables, different columns, different data types, different relationships. A database of e-commerce transactions looks nothing like a database of clinical trials. There was no universal representation that could absorb both.
And there was no public pool of relational databases equivalent to Common Crawl for text or LAION for images. Enterprise databases are private by definition. You cannot scrape them from the internet.
These two problems, universal representation and data availability, are why structured data was the last frontier for foundation models. Both have now been solved.
The failed approaches
LLMs on tables: serialize and hope
The most obvious approach was to use existing LLMs. Serialize the table as JSON, CSV, or markdown, paste it into the prompt, and ask for predictions. Multiple research groups tried this systematically.
The results were disappointing. On the RelBench benchmark (7 databases, 30 tasks, 103 million rows), Llama 3.2 3B achieved 68.06 AUROC on classification tasks. A supervised graph neural network on the same data achieved 75.83. KumoRFM achieved 76.71 zero-shot.
The failure is not about model size. It is architectural. LLMs process data as a sequence of tokens. When you serialize a table as text, you destroy the relational structure. Foreign key relationships become arbitrary text strings. Multi-table joins become impossible (they would exceed context windows). Temporal ordering becomes ambiguous. Numerical precision degrades through tokenization.
To see this concretely, here is how the e-commerce data below would look when serialized for an LLM versus represented as a graph.
what an LLM receives (serialized text)
| row | serialized_input |
|---|---|
| 1 | customer_id=C-801, name=Elena Vasquez, segment=Premium, ... |
| 2 | order_id=ORD-5001, customer_id=C-801, total=$247, ... |
| 3 | review_id=R-201, customer_id=C-801, product_id=PRD-44, rating=5, ... |
| 4 | review_id=R-204, customer_id=C-803, product_id=PRD-44, rating=1, ... |
The LLM sees 'customer_id=C-801' as a text token. It has no structural understanding that C-801 links to ORD-5001 via a foreign key, or that R-201 and R-204 both reference PRD-44. The relational graph is flattened into a string.
what a relational foundation model receives (graph)
| node | type | connected_to | edge_type |
|---|---|---|---|
| C-801 (Elena) | Customer | ORD-5001, ORD-5002, R-201 | placed, placed, wrote |
| ORD-5001 | Order | C-801, PRD-44 | placed_by, contains |
| PRD-44 | Product | R-201, R-202, R-204 | reviewed_by, reviewed_by, reviewed_by |
| R-204 | Review | C-803, PRD-44 | written_by, about |
The graph model sees that PRD-44 connects Elena (5-star review) to another customer C-803 (1-star review). The foreign key structure is preserved as edges. Multi-hop paths are traversable.
An LLM reading a serialized table is like a human reading a novel where every chapter is written in a different language, the chapters are shuffled, and the page numbers are missing. The information is technically present but the structure needed to interpret it is gone.
Single-table tabular foundation models
Several research groups built foundation models specifically for tabular data: TabPFN, CARTE, TabFM, and others. These models are pre-trained on diverse single-table datasets and can make predictions on new flat tables without training.
They work well within their scope. But their scope is limited to single flat tables. Enterprise data is not a single flat table. It is a relational database with 10 to 50 interconnected tables. Using a tabular foundation model on enterprise data still requires the same feature engineering step: flatten the relational database into a single table, then feed it to the model.
These models automate the modeling step on flat data. They do not address the 80% of work that is converting relational data to flat data.
What multi-table data actually looks like
To see why LLMs fail and relational foundation models succeed, consider a concrete e-commerce database. The signal that predicts customer behavior spans multiple tables simultaneously.
customers
| customer_id | name | segment | signup_date | region |
|---|---|---|---|---|
| C-801 | Elena Vasquez | Premium | 2023-04-12 | West |
| C-802 | Tom Fischer | Standard | 2024-01-08 | Midwest |
| C-803 | Aisha Patel | Premium | 2022-11-20 | Northeast |
orders
| order_id | customer_id | total | date | channel |
|---|---|---|---|---|
| ORD-5001 | C-801 | $247.00 | 2025-09-15 | Mobile App |
| ORD-5002 | C-801 | $89.50 | 2025-10-03 | Website |
| ORD-5003 | C-802 | $34.99 | 2025-10-28 | Mobile App |
| ORD-5004 | C-803 | $512.00 | 2025-08-20 | Website |
| ORD-5005 | C-803 | $78.00 | 2025-11-01 | Mobile App |
reviews
| review_id | customer_id | product_id | rating | date |
|---|---|---|---|---|
| R-201 | C-801 | PRD-44 | 5 | 2025-09-18 |
| R-202 | C-802 | PRD-44 | 2 | 2025-11-02 |
| R-203 | C-803 | PRD-71 | 4 | 2025-08-25 |
| R-204 | C-803 | PRD-44 | 1 | 2025-11-05 |
Highlighted: two customers gave product PRD-44 low ratings. A foundation model links these reviews to purchase patterns of other PRD-44 buyers, propagating the quality signal across the graph.
The breakthrough: relational data as graphs
The key insight came from recognizing that every relational database is a graph. Each row is a node. Each foreign key is an edge. Timestamps create temporal ordering. A database with 15 tables, 100 million rows, and 500 million foreign key relationships is a temporal heterogeneous graph with 100 million nodes and 500 million edges.
This representation is universal. It works regardless of the schema. An e-commerce database (customers, orders, products, reviews) and a clinical trial database (patients, visits, diagnoses, prescriptions) both become temporal heterogeneous graphs with the same mathematical structure. The node types and edge types differ, but the graph operations (message passing, attention, aggregation) are identical.
Relational Deep Learning, published at ICML 2024 by Robinson, Fey, et al. (Stanford and Kumo.ai), formalized this approach. They introduced RelBench as the standard benchmark and showed that graph neural networks trained on the relational graph outperform manual feature engineering across 30 tasks on 7 databases.
This solved the representation problem. Any relational database, any schema, becomes a graph that a single architecture can process.
KumoRFM: the first relational foundation model
KumoRFM is a graph transformer pre-trained on billions of rows across thousands of diverse relational databases. It is the first model that generalizes to new relational databases zero-shot, meaning it makes predictions on databases it has never seen during training.
Architecture
KumoRFM converts any input relational database into a temporal heterogeneous graph. It then applies a graph transformer with cross-table attention: each node can attend to nodes in other tables across foreign key edges, weighted by learned relevance. Temporal positional encodings ensure the model respects time ordering, so it does not leak future information into past predictions.
The architecture handles heterogeneous node types (each table has different columns), heterogeneous edge types (different foreign key relationships have different semantics), and temporal dynamics (the same entity at different points in time has different neighborhoods).
Pre-training
During pre-training, KumoRFM learns to predict masked node attributes and future events across thousands of databases. This is analogous to how GPT learns by predicting the next token. The model discovers universal patterns that recur across relational data: recency effects (recent events predict near-term outcomes), frequency patterns (activity levels correlate with engagement), temporal dynamics (accelerating or decelerating trends), graph topology (cluster structure predicts behavior), and cross-table propagation (attributes propagate through foreign key paths).
These patterns are not hard-coded. They are learned from the data, and they transfer across domains. Recency effects in e-commerce purchases follow the same mathematical pattern as recency effects in clinical trial visits or financial transactions.
Zero-shot inference
At inference time, you point KumoRFM at a new relational database and describe your prediction task in Predictive Query Language (PQL). The model converts the database to a graph, applies its pre-trained attention layers, and returns predictions. No training. No feature engineering. No pipeline.
PQL Query
PREDICT SUM(orders.total, 0, 90) > 0 FOR EACH customers.customer_id
Will this customer make a purchase in the next 90 days? The model reads customers, orders, and reviews as a graph. It discovers that C-803's declining review scores and lengthening order gaps signal disengagement.
Output
| customer_id | purchase_probability | top_signal |
|---|---|---|
| C-801 | 0.88 | Consistent order cadence, high review scores |
| C-802 | 0.41 | Single purchase, negative review on key product |
| C-803 | 0.29 | Declining review sentiment, widening order gaps |
The numbers
On RelBench classification tasks (7 databases, 12 tasks):
| Approach | Avg AUROC | Training required |
|---|---|---|
| LightGBM + manual features | 62.44 | 12.3 hours + training per task |
| Llama 3.2 3B (serialized tables) | 68.06 | Prompt engineering per task |
| Supervised GNN | 75.83 | Training per task, no feature eng. |
| KumoRFM (zero-shot) | 76.71 | None |
| KumoRFM (fine-tuned) | 81.14 | Minimal fine-tuning |
Two results stand out. First, KumoRFM zero-shot (no task-specific training whatsoever) outperforms a supervised GNN that was trained specifically for each task. Pre-training on diverse relational data produces better representations than training on any single database. Second, the LLM approach (Llama 3.2 3B) is 8.65 AUROC points below KumoRFM, confirming that text-based architectures are structurally wrong for relational data.
Current state: fragmented approaches
- LLMs for text, CNNs for images, nothing for relational data
- Every prediction task requires feature engineering from scratch
- 80% of data science time spent on data preparation
- Models trained from scratch for each database and task
- Signal in multi-hop relationships goes undiscovered
Foundation model era for structured data
- Single pre-trained model for any relational database
- Zero-shot predictions without feature engineering
- Multi-hop, temporal, and graph patterns captured automatically
- PQL replaces months of pipeline work with one query
- Performance improves predictably with model scale
PluRel scaling laws: why this gets better
One of the most important findings in the GPT research trajectory was scaling laws: the observation that language model performance improves predictably with model size, following a power law. This meant that investing in larger models was a reliable path to better performance, which justified the enormous compute investments of GPT-3 and GPT-4.
Kumo.ai researchers published PluRel (Power Laws for Unified Relational Learning), demonstrating that relational foundation models exhibit the same scaling behavior. The loss follows:
L(N) = 0.07 * N^(-0.38) + 0.36
Where N is the number of model parameters. The scaling exponent of -0.38 is comparable to the -0.34 observed for GPT-3 (Kaplan et al., 2020). This is not a coincidence. It suggests that relational data contains the same kind of deep, multi-scale structure that makes language data amenable to foundation model approaches.
The practical implication: doubling the model size produces a predictable improvement in accuracy. This means the current results are a lower bound. As compute and data scale increase, relational foundation models will continue improving along a known trajectory.
What this means for enterprise ML
The shift from task-specific models to foundation models in NLP took about 5 years (2018 GPT-1 to 2023 GPT-4 enterprise adoption). The structured data shift is following a compressed timeline because the playbook already exists.
For data science teams
The role of the data scientist shifts from pipeline builder to decision architect. Instead of spending months engineering features for one model, you spend days evaluating which prediction tasks create the most business value. PQL makes it possible to test a new prediction hypothesis in minutes. The bottleneck moves from "can we build this" to "should we build this."
For ML infrastructure
The feature store, training pipeline, model registry, and serving infrastructure that enterprise ML teams have built over the past decade were designed for the train-from-scratch paradigm. Foundation models collapse this stack. The database connects directly to the model. Predictions are served via API. The infrastructure overhead drops dramatically.
For business impact
DoorDash used this approach and saw a 1.8% engagement lift across 30 million users. Databricks saw a 5.4x conversion lift in lead scoring. Snowflake saw a 3.2x expansion revenue lift. These results came not from better models on the same data, but from models that access relational patterns invisible to flat-table approaches.
For competitive advantage
The organizations that adopt relational foundation models first gain a compounding advantage. While competitors spend months building one predictive model through traditional feature engineering, a foundation model approach lets you evaluate dozens of prediction tasks per week. You discover which predictions create the most value faster, deploy them faster, and iterate faster. Over time, this speed advantage compounds into a data moat: more predictions deployed means more feedback data, which means better fine-tuned models, which means more business impact.
The category is forming now
Foundation models for text went from "interesting research" (GPT-1, 2018) to "every enterprise needs this" (ChatGPT, 2022) in four years. Foundation models for structured data are at the beginning of this curve. The research is published (Relational Deep Learning at ICML 2024, RelBench at NeurIPS 2024, PluRel scaling laws). The benchmarks exist. The production deployments are generating real business results.
The companies that built GPT into their workflows early (not in 2024, but in 2020-2021 when it was still GPT-3) gained years of compound advantage. The same window exists now for relational foundation models. The data that runs your business is stored in relational databases. A model that reads those databases natively, without feature engineering, without training, is not an incremental improvement. It is a category shift.
Every prediction you want to make on enterprise data, from churn to fraud to demand to lifetime value, is a question about patterns in a relational graph. Foundation models are the first technology that can answer those questions directly.