Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
Learn14 min read

Why Flattening Relational Data Kills ML Accuracy

Enterprise data lives in 5-50 connected tables. Every ML tool except Kumo requires you to flatten this into one table first. This flattening destroys multi-hop relationships, temporal sequences, and graph structure - losing 15-20+ AUROC points of predictive signal.

TL;DR

  • 1On the SAP SALT enterprise benchmark, KumoRFM scores 91% accuracy vs 75% for PhD data scientists with XGBoost and 63% for LLM+AutoML - with zero feature engineering and zero training time.
  • 2Enterprise databases are relational: 5-50 interconnected tables with foreign keys, temporal sequences, and graph structure. Every traditional ML tool - XGBoost, LightGBM, TabPFN, AutoML - requires flattening this into a single table before training.
  • 3Flattening destroys at least 6 categories of predictive signal: multi-hop relationships, temporal sequences across tables, graph topology, entity-level aggregation context, cross-table interaction effects, and cardinality information.
  • 4On RelBench, LightGBM on flattened features scores 62.44 AUROC. KumoRFM on the original relational structure scores 76.71 zero-shot and 81.14 fine-tuned - a 14-19 point gap. On some tasks the gap exceeds 27 AUROC points.
  • 5The feature space math explains why: 5 tables with 50 columns produce 1,200+ first-order features, 719,400+ pairwise interactions, and ~8,000 multi-hop features. Human engineers build 50-200 features, covering 4-17% of the space.

Every enterprise database is relational. Customers link to orders. Orders link to products. Products link to reviews. Reviews link back to other customers. The data lives in 5, 10, sometimes 50 interconnected tables with foreign keys, timestamps, and hierarchical relationships.

And every ML tool - XGBoost, LightGBM, random forests, neural networks, AutoML platforms, even the newest tabular foundation models - requires you to collapse all of that structure into a single flat table before it can make a prediction.

This flattening step is so ubiquitous that most data scientists do not question it. It is just "how ML works." You write SQL joins, compute aggregations, build a feature table with one row per entity, and feed it to a model. The entire field of feature engineering exists to make this flattening step less lossy.

But flattening is inherently lossy. And the signal it destroys is precisely the signal that separates good predictions from great ones.

The headline result: SAP SALT benchmark

The SAP SALT benchmark is an enterprise-grade evaluation where real business analysts and data scientists attempt prediction tasks on SAP enterprise data. It measures how accurately different approaches predict real business outcomes on production-quality enterprise databases with multiple related tables.

sap_salt_enterprise_benchmark

approachaccuracywhat_it_means
LLM + AutoML63%Language model generates features, AutoML selects model
PhD Data Scientist + XGBoost75%Expert spends weeks hand-crafting features, tunes XGBoost
KumoRFM (zero-shot)91%No feature engineering, no training, reads relational tables directly

SAP SALT benchmark: KumoRFM outperforms expert data scientists by 16 percentage points and LLM+AutoML by 28 percentage points on real enterprise prediction tasks.

KumoRFM scores 91% where PhD-level data scientists with weeks of feature engineering and hand-tuned XGBoost score 75%. The 16 percentage point gap is the value of reading relational data natively instead of flattening it into a single table.

What flattening destroys

When you collapse a relational database into a flat table, you lose at least six categories of predictive signal that cannot be recovered by any downstream model, no matter how sophisticated.

signal_types_destroyed_by_flattening

signal_typewhat_it_capturesexampleflat_table_substitute
Multi-hop relationshipsPatterns across 3+ connected tablescustomer → orders → products → reviews → similar customersNone. Joins typically stop at 1-2 hops.
Temporal sequences across tablesActivity progression patterns over timeLogin → Browse → Add to cart → Abandon → Support ticket (in order)Scalar aggregates: pages_viewed=22, cart_abandons=3
Graph topologyStructural patterns like rings, clusters, hubsA → B → C → D → A (fraud ring), social clustersInvisible. Single-row features cannot represent cycles.
Entity-level aggregation contextHow an entity relates to its full neighborhoodA customer’s merchant diversity (50 unique merchants vs. 3)A single count: num_merchants=50. Context lost.
Cross-table interaction effectsCorrelations between events in different tablesProduct returns × support tickets × review sentimentRequires pre-computed interaction features. Rarely built.
Cardinality informationHow many related entities exist and their distributionLead has 4 contacts from 3 departments (multi-threaded)contact_count=4. Department spread gone.
Temporal decay patternsRecency-weighted importance of related eventsRecent orders matter more than old ones for churnavg_order_value (treats all orders equally)
Heterogeneous relationship typesDifferent edge types carry different meaningpurchased vs. returned vs. reviewed vs. wishlistedAll collapsed into generic aggregates

Highlighted: the top three signal types - multi-hop relationships, temporal sequences, and graph topology - are the most common sources of large accuracy gaps between flat and relational approaches.

Concrete example: Lead scoring

Consider Lead L-302 in a B2B CRM. The relational database contains rich, multi-table context about this lead:

  • 4 contacts from 3 departments are active on the account - a multi-threaded buying committee, which is the strongest predictor of enterprise deal closure.
  • Content progression: Blog → Case study → API docs → Demo request. This is a textbook buying journey from awareness to evaluation.
  • Similar account closed $210K last quarter. The account-similarity signal comes from matching company attributes and engagement patterns across the opportunities table.
  • Company raised Series B 30 days ago. Firmographic momentum from the accounts table indicates budget availability.

lead_L-302_relational_vs_flat

data_sourcerelational_signalflat_table_value
contacts table4 contacts from 3 departments (multi-threaded)emails_opened=4
activities tableBlog → Case study → API docs → Demo (buying progression)pages_viewed=22
opportunities tableSimilar account closed $210K last quarterNot captured
accounts tableCompany raised Series B 30 days agocompany_size=200

Every relational signal that makes L-302 a strong lead is destroyed or reduced to a meaningless scalar in the flat table. The model sees emails_opened=4 and pages_viewed=22, not a multi-threaded buying committee with a textbook content progression.

A flat-table model sees: emails_opened=4, pages_viewed=22, company_size=200. It has no way to know that those 4 emails came from 3 different departments, that the 22 page views followed a specific buying-stage progression, or that a similar account just closed a $210K deal. All of the signal that makes this a high-value lead is invisible.

Concrete example: Fraud detection

Account A sends $500 to Account B. Account B sends $480 to Account C. Account C sends $460 to Account D. Account D sends $440 back to Account A. Each individual transaction looks perfectly normal - a modest transfer between two accounts.

But the pattern is a fraud ring: A → B → C → D → A. Money is cycling through four accounts, with small amounts skimmed at each step. This circular flow is a classic money laundering pattern, and it is only visible when you can see the graph structure of transactions.

When you flatten the transaction data into a single row per transaction, each row contains: sender_id, receiver_id, amount, timestamp. There is no column for "this transaction is part of a four-node cycle." The ring is invisible. No amount of feature engineering on a single transaction row can recover the circular topology - because the pattern exists across four rows, not within one.

Concrete example: Churn prediction

Member Bob visits his gym 4 times per week on average. Over the last 6 weeks, his visit frequency dropped 68% - from 4 visits to 1.3. That alone is a churn signal. But the relational data reveals more:

  • Bob's workout buddies are also churning. Two of his three regular workout partners have cancelled in the last month. Social churn is contagious - when your peers leave, you are far more likely to leave.
  • Bob downgraded his plan from Premium to Basic last billing cycle, reducing his monthly payment from $79 to $29. This is a leading indicator of full cancellation.
  • Bob stopped attending group classes. His class attendance went from 3 per week to 0. Group class members have 2.3x higher retention, so losing this engagement channel is significant.

When flattened: visit_frequency=1.3, plan_type=Basic, monthly_spend=$29. The social churn signal - the fact that Bob's friends are leaving - disappears entirely. The peer behavior pattern exists in the relationships between members, not in any single member's row.

The accuracy gap: RelBench results

The destruction is not theoretical. The RelBench benchmark (7 databases, 30 tasks, 103 million rows) measures exactly how much signal is lost when you flatten relational data versus processing it in its native structure.

AUROC (Area Under the Receiver Operating Characteristic curve) measures how well a model distinguishes between positive and negative outcomes. An AUROC of 50 means random guessing, 100 means perfect prediction. Moving from 65 to 77 AUROC means the model correctly ranks a true positive above a true negative 77% of the time instead of 65%.

relbench_accuracy_comparison

approachAUROC (classification)gap_vs_flatwhat_it_processes
LightGBM on flattened features62.44BaselineFlat table (manual features)
XGBoost on flattened features~63-64+1-2 ptsFlat table (manual features)
KumoRFM zero-shot76.71+14.27 ptsFull relational structure
KumoRFM fine-tuned81.14+18.70 ptsFull relational structure + task adaptation

Highlighted: the 14-19 AUROC point gap between flat-table and relational approaches. This gap represents the predictive signal destroyed by flattening.

tasks_with_largest_accuracy_gaps

taskLightGBM_flatKumoRFM_finetunedabsolute_gaprelative_improvement
rel-stack user-engagement63.3990.59+27.20 pts43%
rel-hm item-sales57.1278.84+21.72 pts38%
rel-avito ad-click59.2177.93+18.72 pts32%
rel-f1 driver-position64.8881.02+16.14 pts25%
rel-event user-attendance61.4576.71+15.26 pts25%

Highlighted: rel-stack user-engagement shows a 27+ AUROC point gap - the largest in the benchmark. User engagement patterns are deeply relational (users \u2192 posts \u2192 comments \u2192 votes \u2192 tags), and flattening destroys the interaction graph.

The gap is not uniform. Tasks that depend heavily on multi-hop relationships, temporal sequences, and graph structure show the largest gaps. Tasks that are well-served by simple aggregations (count, sum, mean) show smaller gaps. But in no case does the flat approach match the relational approach.

The feature space math

The accuracy gap has a mathematical explanation. Consider a modest enterprise database with 5 tables and 50 columns per table.

feature_space_coverage

feature_typecounthuman_engineers_buildcoverage
First-order features (single column aggregations)1,200+40-803-7%
Pairwise interaction features719,400+10-50~0.01%
Multi-hop features (2+ relationship hops)~8,000+0-200-0.25%
Temporal window variants (7d, 30d, 90d)3x multiplier on all above20-50 windows~1%
Total explorable feature space~2.2 million+50-200 features4-17%

Highlighted: human data scientists explore 4-17% of the possible feature space. The remaining 83-96% is unexamined signal that a relational foundation model can access automatically.

This is not a criticism of data scientists. No human can enumerate 2.2 million features and test them for predictive value. The combinatorial space is too large. Data scientists use domain knowledge to build the 50-200 features they believe matter most. But domain knowledge is biased toward obvious signals and misses subtle multi-hop interactions.

A foundation model that reads the relational structure directly does not enumerate features at all. It learns a continuous representation of the entire relational neighborhood around each entity, implicitly capturing all of the patterns that exist in the data - including the 83-96% of the feature space that human engineers never explore.

Why TabPFN and Fundamental do not solve this

TabPFN (from the University of Freiburg) and Fundamental are tabular foundation models - pre-trained models designed for tabular data. They represent genuine advances in model architecture. On single-table benchmarks, they often match or beat well-tuned XGBoost and LightGBM ensembles.

But they are still tabular models. Their input is a flat table with one row per entity and a fixed number of columns. The flattening step - the SQL joins, the aggregations, the lossy collapse of relational structure into scalar features - happens before TabPFN or Fundamental ever sees the data.

Think of it this way: TabPFN is a better lens for looking at a photograph. KumoRFM is a better camera that captures more of the scene. No amount of lens improvement can recover detail that was never captured in the photograph.

tabular_fm_vs_relational_fm

capabilityTabPFN / FundamentalKumoRFM
Input formatSingle flat tableMultiple relational tables
Handles multi-table joinsNo (requires pre-flattening)Yes (reads foreign keys directly)
Multi-hop pattern discoveryNoYes (graph message passing)
Temporal sequence preservationNo (static features only)Yes (timestamps on nodes and edges)
Graph topology awarenessNoYes (heterogeneous graph transformer)
Pre-training dataSingle tables from OpenMLThousands of relational databases
The flattening problemNot addressedEliminated entirely

Highlighted: tabular foundation models do not address the flattening problem. They improve what happens after flattening. Relational foundation models eliminate the need to flatten in the first place.

How KumoRFM avoids flattening

KumoRFM takes a fundamentally different approach. Instead of requiring a flat table, it represents the entire relational database as a temporal heterogeneous graph:

  • Each row in each table becomes a node. A customer row is a customer node. An order row is an order node. A product row is a product node.
  • Each foreign key becomes an edge. The customer_id foreign key in the orders table creates an edge from each order node to its customer node. The product_id foreign key creates an edge from each order to its product.
  • Timestamps are preserved. Every node and edge carries its temporal information. The model can distinguish between a customer who placed 10 orders last week and a customer who placed 10 orders over 3 years.
  • A graph transformer processes the full structure. Information propagates through the graph via message passing. After 3 layers, each node has aggregated context from entities up to 3 hops away - capturing exactly the multi-hop patterns that flattening destroys.

PQL Query

PREDICT churn_90d
FOR EACH members.member_id
WHERE members.status = 'active'

One query replaces the entire flatten-and-model pipeline. KumoRFM reads the members, visits, classes, payments, and social connections tables directly. It discovers that Bob's workout buddies are churning, his visit frequency is declining, and he downgraded his plan - without any feature engineering.

Output

member_idchurn_probapproach_comparisonkey_signal
M-4401 (Bob)0.89 (relational)0.54 (flat)Peer churn + frequency drop
M-44020.12 (relational)0.18 (flat)Stable peers, increasing visits
M-44030.71 (relational)0.41 (flat)Class dropout + plan downgrade
M-44040.06 (relational)0.09 (flat)High engagement, no risk signals

The bottom line: flattening is the bottleneck

The ML industry has spent a decade building better models for flat tables: XGBoost, LightGBM, CatBoost, TabPFN, Fundamental, AutoML ensembles. Each new model squeezes another 1-3 AUROC points out of the same flat feature table. Meanwhile, the gap between flat features and full relational data is 14-19 points.

The bottleneck was never the model. It was always the data representation. Flattening a relational database into a single table destroys the very patterns that differentiate accurate predictions from mediocre ones: multi-hop relationships, temporal sequences, graph topology, and cross-table interactions.

The solution is not better feature engineering. The solution is not trying harder to flatten without losing signal. The solution is eliminating the flattening step entirely - reading relational data in its native structure, the way it was designed to be stored.

That is what relational foundation models do. And the 14-19 AUROC point improvement is the signal that flattening was destroying all along.

Frequently asked questions

What does it mean to flatten relational data?

Flattening relational data means joining multiple database tables into a single flat table with one row per entity. For example, to predict customer churn, you join the customers, orders, products, support tickets, and payments tables into one wide table where each row represents a customer. This requires writing SQL joins and aggregations (like avg_order_value or total_support_tickets) to collapse related rows into scalar features. The process discards graph topology, temporal sequences, multi-hop relationships, and cardinality information that existed in the original relational structure.

Why do most ML tools require flat tables?

Most ML algorithms - XGBoost, LightGBM, random forests, logistic regression, neural networks, and even newer tabular foundation models like TabPFN - are designed to process fixed-width feature vectors. Each training example must be a single row with a fixed number of columns. Relational databases have variable-length relationships (a customer may have 3 orders or 300), hierarchical structure, and temporal dynamics that do not fit into a fixed-width row. Flattening is the workaround that forces relational data into the format these models require.

How much accuracy is lost by flattening?

On the RelBench benchmark (7 databases, 30 tasks, 103 million rows), LightGBM on manually engineered flat features scores 62.44 AUROC. KumoRFM on the original relational structure scores 76.71 zero-shot and 81.14 fine-tuned. That is a 14-19 AUROC point gap. On specific tasks, the gap is even larger: rel-stack user-engagement goes from 63.39 (flat) to 90.59 (relational), a 43% relative improvement. The lost signal comes from multi-hop patterns, temporal sequences, and graph topology that flattening destroys.

Can better feature engineering close the gap?

Only partially. A database with 5 tables and 50 columns has 1,200+ first-order features, 719,400+ pairwise interactions, and ~8,000 multi-hop features. Human data scientists typically build 50-200 features, covering 4-17% of the feature space. Even the most experienced team will miss the majority of predictive patterns. The combinatorial explosion makes exhaustive manual feature engineering practically impossible. A relational foundation model explores the full feature space automatically.

Do tabular foundation models like TabPFN solve this problem?

No. TabPFN, Fundamental, and other tabular foundation models are designed for flat tables. They accept a single table as input and learn patterns within that table. The flattening step happens before these models ever see the data. They may be better than XGBoost at learning from the features you give them, but they cannot recover signals that were destroyed during flattening. The problem is not the model architecture - it is the data representation.

How does KumoRFM avoid flattening?

KumoRFM represents the entire relational database as a temporal heterogeneous graph. Each row in each table becomes a node. Each foreign key relationship becomes an edge. Timestamps are preserved as temporal attributes. A graph transformer processes this structure by passing messages along edges, learning which cross-table patterns are predictive. Multi-hop patterns (customer to orders to products to returns) are captured naturally because information propagates through the graph layer by layer, without any flattening or manual feature engineering.

See it in action

KumoRFM delivers predictions on relational data in seconds. No feature engineering, no ML pipelines. Try it free.