Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
Learn13 min read

Why LLMs Fail on Structured Data (And What Works Instead)

Everyone's first instinct: throw GPT at the spreadsheet. Llama 3.2 3B scored 68.06 AUROC on relational tasks. KumoRFM scored 76.71 on the same tasks, zero-shot. The training objective is the problem.

TL;DR

  • 1On RelBench classification tasks: Llama 3.2 3B scores 68.06 AUROC. KumoRFM scores 76.71 zero-shot (81.14 fine-tuned). The 8.65-point gap is an architectural mismatch, not a scaling problem.
  • 2LLMs are trained for next-token prediction. Tabular data is unordered rows, heterogeneous types (int, float, categorical, datetime), and statistical cross-row patterns. These are fundamentally different problems.
  • 3Serializing tables as text destroys type information. The number 15,847.32 becomes four tokens. Enterprise tables with 10M+ rows cannot fit in any context window. The LLM sees 0.025% of the data at best.
  • 4Graph transformers match the data structure: rows are nodes, foreign keys are edges, types are natively encoded, and millions of rows are processed as graph structure rather than token sequences.
  • 5LLMs excel at adjacent data tasks (natural language interfaces, documentation, result interpretation, code generation) but should not be used for the prediction task itself on structured relational data.

The idea is natural. LLMs can do everything else. They write code, summarize documents, answer questions, generate images. Why not point them at a database and ask "which customers will churn?"

People have tried. Research teams at Google, Meta, and dozens of startups have spent the last two years exploring LLMs on tabular data. The results are consistent: LLMs underperform purpose-built approaches, often by a wide margin.

On the RelBench benchmark (7 databases, 30 tasks, 103 million rows), Llama 3.2 3B scored 68.06 AUROC on classification tasks. KumoRFM, a foundation model designed for relational data, scored 76.71 AUROC zero-shot. A supervised graph neural network scored 75.83. Even LightGBM with manual features was competitive at 62.44 once you account for the quality of features.

This is not a scaling problem. It is an architectural mismatch. Here is why.

serialization_example

customer_idrevenueplancreated_atis_active
48291$15,847.32Enterprise2024-03-15true
72104$2,340.00Basic2025-01-08true
55893$891.50Pro2024-11-22false

An LLM sees this as: "48291 | $15,847.32 | Enterprise | 2024-03-15 | true". The number 15847.32 becomes tokens ["15", "847", ".", "32"]. The model cannot natively understand that $15,847 is close to $16,000 but far from $1.58.

architecture_mismatch

propertytext (LLM native)tabular data (actual)consequence
OrderSequential (word order matters)Unordered (row shuffle = same data)LLM assigns meaning to row position
TypesUniform tokens from vocabularyMixed: int, float, categorical, datetimeNumerical reasoning brittle on tokens
ScaleContext window: 128K-1M tokensEnterprise table: 10M+ rowsLLM sees 0.005% of data
PatternsSequential dependenciesStatistical relationships across rowsWrong optimization objective
StructureFlat sequenceMulti-table with foreign keysNo native relational representation

Five architectural mismatches between LLMs and tabular data. Each one independently degrades performance. Together, they explain the 8.65 AUROC gap between Llama 3.2 3B (68.06) and KumoRFM (76.71).

The training objective mismatch

LLMs are trained to predict the next token in a sequence. This objective is brilliant for language. Language is sequential, each word depends on the words before it, and the training signal is rich (every token provides a gradient).

Tabular data is not sequential. It has three properties that make next-token prediction fundamentally wrong:

1. Row order is meaningless

In text, the sentence "the cat sat on the mat" means something different from "mat the on sat cat the." Word order carries semantic information, and LLMs learn to exploit it.

In a table, rows are unordered. Shuffling the rows of a customer table does not change any prediction. The model needs to be permutation-invariant with respect to rows. LLMs are not. They have positional embeddings that assign meaning to position, which means the same data presented in a different row order can produce different predictions.

2. Column types are heterogeneous

Text is a sequence of tokens from a fixed vocabulary. A table row might contain an integer (customer_id: 48291), a float (revenue: 15,847.32), a categorical (plan: enterprise), a timestamp (created_at: 2024-03-15 09:42:11), and a boolean (is_active: true).

Serializing these into text tokens destroys the type information. The number 15847.32 becomes the token sequence ["15", "847", ".", "32"], which the LLM treats as four separate pieces. It cannot natively understand that 15847.32 is close to 15900 but far from 1.58. Numerical reasoning on serialized text is brittle and unreliable.

3. Predictive patterns are statistical, not sequential

In text, the pattern is "given this sequence of words, what comes next?" In tabular data, the pattern is "given the statistical relationships across rows and columns, what is the value of this target variable?"

Predicting churn requires understanding that customers with declining login frequency AND increasing support tickets AND approaching contract renewal have a high churn probability. This is a multivariate statistical pattern across multiple tables. It is not a sequential completion task.

What the benchmarks show

The RelBench benchmark provides the most rigorous comparison available. Here is how different approaches perform on the same classification tasks:

ApproachAUROCArchitecture
LightGBM + manual features62.44Gradient boosted trees on flat table
Llama 3.2 3B (text serialization)68.06LLM with tables serialized as text
Supervised GNN (RDL)75.83Graph neural network on relational graph
KumoRFM zero-shot76.71Pre-trained graph transformer
KumoRFM fine-tuned81.14Fine-tuned graph transformer

The LLM (68.06) outperforms LightGBM with manual features (62.44), but this says more about the limitations of manual feature engineering than about the strength of LLMs. The LLM sees the raw data and captures some cross-column patterns that manual features miss. But it still falls 8.65 points short of KumoRFM zero-shot, which sees the same raw data through a graph-native architecture.

The serialization problem

To feed tabular data to an LLM, you have to serialize it as text. There are several approaches, and none of them work well.

Row-by-row serialization

Convert each row to a text string: "Customer 48291 has plan enterprise, revenue 15847.32, created on 2024-03-15, is active." The LLM processes each row as a text passage.

Problem: the model sees one row at a time. It cannot compare across rows (is 15847.32 high or low for this segment?) without seeing all rows simultaneously. Context windows cap at 128K to 1M tokens, but enterprise tables have millions of rows. You physically cannot fit the data into the context.

Table-as-markdown

Format the table as markdown with headers and pipes. This preserves column alignment and lets the model see multiple rows.

Problem: you can fit maybe 200-500 rows in a context window. For a table with 10 million rows, you are showing the model 0.005% of the data. Any statistical pattern that requires seeing the full distribution is invisible.

JSON serialization

Represent each row as a JSON object. This preserves column names and types better than markdown.

Problem: even more verbose than markdown. Fewer rows fit in the context window. And the fundamental issues (row order sensitivity, numerical reasoning brittleness) remain.

The multi-table problem

Everything above applies to a single table. Enterprise databases have 10 to 50 tables connected by foreign keys. The relational structure (customers → orders → products → reviews) carries critical predictive information.

multi-hop pattern (3 tables, invisible to LLM)

hoptabledatasignal
1customersCustomer 48291 placed 8 ordersModerate activity
2order_items6 of 8 orders included Product P-442Strong product affinity
3reviews (by other customers)P-442 avg rating dropped from 4.5 to 2.1 in 60 daysProduct quality collapse
SignalCustomer 48291 is loyal to a product that is failingChurn risk: HIGH

Highlighted: the churn signal is at hop 3, where OTHER customers' reviews of the SAME product reveal a quality collapse. This requires: customers to orders to products to reviews. An LLM processing serialized rows from the customers table cannot reach this signal.

what the LLM actually sees (serialized text)

input_formattext_sent_to_LLMcan_it_detect_the_signal?
Customer row"48291 | Enterprise | $15,847.32 | 2024-03-15"No (no order or product data)
+ Order rows"O-7823 | 48291 | P-442 | $89.00 | 2025-01-15"Partial (sees product ID, not reviews)
+ Review rows (all 4M)Cannot fit: 4M reviews x ~80 tokens = 320M tokensImpossible (exceeds any context window)

To detect the multi-hop churn signal, the LLM would need the customer row, their 8 order rows, and the review history of product P-442 from OTHER customers. The 4M review table alone exceeds any context window by 100x.

LLMs have no mechanism for representing relational structure. You can serialize multiple tables as text, but the foreign key relationships become implicit references ("order 7823 references customer 48291") rather than structural connections. The model has to parse text to reconstruct relationships that a graph model represents natively.

Multi-hop patterns (a customer's churn depends on the return rates of products they bought, which depends on the manufacturer's quality metrics) require traversing 3-4 tables through foreign keys. In a graph, this is 3-4 hops of message passing. In serialized text, this requires the LLM to piece together scattered text references across multiple serialized tables. In practice, LLMs fail at this.

LLM on tabular data

  • Next-token prediction objective
  • Row-order dependent (positional embeddings)
  • Serializes numbers as token sequences
  • Cannot fit large tables in context window
  • No native multi-table representation

Graph transformer (KumoRFM)

  • Relational pattern learning objective
  • Permutation-invariant over rows
  • Native numerical and categorical encoding
  • Processes millions of rows as graph structure
  • Multi-table relationships as edges in the graph

serialization_methods_compared

methodtokens_per_rowrows_in_128K_context% of 10M tablepreserves_types
Row-by-row text~80~1,6000.016%No
Markdown table~50~2,5000.025%Partial
JSON objects~120~1,0000.010%Better
CSV format~30~4,2000.042%No
Graph representationN/AAll 10M rows100%Yes (native encoding)

Highlighted: graph-based representations process all rows because they do not serialize data into text tokens. They encode numerical values, categorical types, and relationships natively.

PQL Query

PREDICT churn
FOR EACH customers.customer_id
WHERE customers.is_active = true

Compare what an LLM and a graph transformer see for the same prediction. The LLM serializes 0.025% of rows as text. The graph transformer sees all rows, all tables, all foreign key relationships, with native type encoding.

Output

customer_idLLM_predictionKumoRFM_predictionactualdelta
482910.350.12No churnFM correct
721040.510.84ChurnedFM correct
558930.440.91ChurnedFM correct, LLM missed
630170.620.07No churnFM correct, LLM false alarm

What works instead

The approaches that work well on structured data share a common property: they match the model architecture to the data structure.

Gradient boosted trees (single-table)

For flat, single-table data with pre-engineered features, XGBoost and LightGBM remain strong. They handle heterogeneous column types natively, are invariant to feature scaling, and learn non-linear relationships through decision splits. On Kaggle tabular benchmarks, they consistently outperform LLMs.

Limitation: they require a flat feature table, so the feature engineering bottleneck remains for multi-table data.

Graph neural networks (multi-table)

GNNs represent the relational database as a graph and learn patterns through message passing. This is architecturally correct: the model structure matches the data structure. On RelBench, supervised GNNs score 75.83 AUROC, outperforming both LightGBM (62.44) and Llama 3.2 3B (68.06).

Limitation: you need to train a new GNN for each prediction task.

Relational foundation models

KumoRFM combines the graph-native architecture of GNNs with the pre-training paradigm of foundation models. It is trained on thousands of diverse relational databases, learning universal patterns that transfer to new data. At inference time, it delivers predictions from raw relational data without task-specific training.

This achieves 76.71 AUROC zero-shot, outperforming both LLMs and supervised GNNs. Fine-tuning pushes accuracy to 81.14.

When LLMs do help with data

LLMs are not useless in the data ecosystem. They are just wrong for the specific task of making predictions on structured data. They excel at adjacent tasks:

  • Natural language interfaces. Translating business questions ("which customers are at risk of churning?") into structured queries or PQL statements.
  • Data documentation. Generating descriptions of tables, columns, and data dictionaries from schema inspection.
  • Result interpretation. Explaining predictions in natural language for business stakeholders who do not read AUROC scores.
  • Code generation. Writing SQL, Python, or PQL queries from natural language descriptions.

The right architecture for predictions on structured data is one that was designed for structured data. LLMs were designed for language. Use each where it fits.

The takeaway

The instinct to throw LLMs at every problem is understandable. They are the most capable general-purpose AI tools ever built. But "general-purpose" does not mean "optimal for every purpose."

Structured relational data has specific properties (unordered rows, heterogeneous types, multi-table relationships, temporal dynamics) that require specific architectural choices. Graph transformers, pre-trained on relational data, match these properties. LLMs do not.

The 8.65 AUROC gap between Llama 3.2 3B (68.06) and KumoRFM (76.71) is not a tuning problem. It is a structural mismatch between the training objective and the task. Scaling the LLM larger will narrow the gap, but it will not close it, because the architecture is solving the wrong problem.

If your data lives in relational tables and you want accurate predictions, use a model that was built to read relational tables.

Frequently asked questions

Can GPT-4 or other LLMs make predictions on tabular data?

They can try, but the results are poor compared to purpose-built approaches. LLMs trained on text lack the inductive biases needed for tabular data: they cannot natively handle numerical distributions, cross-row patterns, multi-table relationships, or temporal sequences. On the RelBench benchmark, Llama 3.2 3B scored 68.06 AUROC on classification tasks, while KumoRFM scored 76.71 zero-shot.

Why is next-token prediction wrong for tabular data?

Next-token prediction trains a model to predict the next word in a sequence. Tabular data is not sequential text. Rows are unordered, columns have heterogeneous types (numerical, categorical, timestamps), and the predictive patterns are statistical relationships across rows and tables, not sequential dependencies. Serializing a table as text and predicting the next token forces a text-shaped objective onto a fundamentally non-text problem.

What about fine-tuning an LLM specifically on tabular data?

Fine-tuning helps, but it cannot overcome the architectural mismatch. LLM architectures (decoder-only transformers with positional embeddings for sequential text) are not designed for the permutation-invariant, multi-type, multi-table structure of relational data. Research from multiple groups shows that even fine-tuned LLMs underperform gradient boosted trees on most tabular benchmarks.

What model architecture actually works for structured data?

Graph neural networks and graph transformers, which represent relational databases as temporal heterogeneous graphs. Each row becomes a node, each foreign key becomes an edge, and the model learns patterns by passing messages along the relational structure. KumoRFM uses this approach, pre-trained on thousands of databases, and achieves 76.71 AUROC zero-shot on RelBench classification tasks.

Could future LLMs get better at tabular data?

Possibly, but the fundamental training objective mismatch would need to change. Some research explores hybrid approaches that combine language understanding with structured data processing. However, for pure predictive tasks on relational data, purpose-built architectures like graph transformers have a structural advantage because they match the data's native topology.

See it in action

KumoRFM delivers predictions on relational data in seconds. No feature engineering, no ML pipelines. Try it free.