Relational Graph Transformers: A New Frontier in AI for Relational Data | Kumo.ai

01

From GNNs to Graph Transformers

Graph Neural Networks (GNNs) brought a powerful idea to relational data: treat rows as nodes, foreign keys as edges, and let the model learn representations by passing messages along those connections. For years, message-passing GNNs were the standard approach to learning from relational databases converted into graph form.

The process is straightforward. An enterprise database with tables like customers, transactions, and products becomes a heterogeneous graph. Each row is a node. Each foreign key relationship (a transaction referencing a customer, a product linked to a category) becomes an edge. The GNN then propagates information along these edges, layer by layer, allowing each node to gather context from its neighbors.

This approach works. It captures relational structure directly from the data without manual feature engineering. But as researchers pushed GNNs to handle larger, more complex relational databases with deeper multi-hop dependencies, fundamental limitations emerged. These limitations are not implementation bugs. They are structural constraints of the message-passing paradigm itself.

The evolution from GNNs to graph transformers follows a pattern familiar from natural language processing. Recurrent neural networks processed text sequentially, word by word, just as GNNs process graphs hop by hop. Transformers replaced that sequential bottleneck with global attention, allowing every token to attend to every other token directly. Relational Graph Transformers apply the same principle to relational data: every node can directly interact with any other node, regardless of graph distance.

02

Why Message-Passing GNNs Hit a Wall

To understand why Relational Graph Transformers are necessary, you need to understand the specific failure modes of message-passing GNNs on relational data. There are three core problems: the multi-hop bottleneck, over-squashing, and limited expressiveness.

The multi-hop bottleneck

In a typical relational database converted to a graph, related entities are often separated by multiple hops. Consider a simple e-commerce schema: a customer's transactions are always two hops away from each other, connected only through the shared customer node. In a two-layer GNN, each transaction node can only receive information from its direct neighbors (one hop out per layer). For two transactions to “see” each other, information must travel: transaction A → customer → transaction B. The customer node becomes a mandatory relay point for all transaction-to-transaction communication.

Adding more GNN layers extends the receptive field, but at a cost. Each additional layer increases computation, adds parameters, and makes training less stable. In practice, most GNNs use 2-4 layers, which limits the effective range of information flow to 2-4 hops. For relational databases where important patterns span 5+ hops across tables, this is a hard ceiling.

Over-squashing

Even when a GNN has enough layers to theoretically reach distant nodes, the information from those nodes gets compressed through bottleneck nodes. This is over-squashing: the phenomenon where information from an exponentially growing number of nodes must be compressed into a fixed-dimensional vector at each intermediate node.

In relational data, this problem is severe. A customer node might connect to hundreds of transactions, each connected to products, categories, and reviews. After two hops, the customer node must somehow compress information from thousands of nodes into a single fixed-size vector. Inevitably, information is lost. Subtle but important patterns (a specific product-category combination that predicts churn) get washed out in the aggregation.

Limited expressiveness

The Weisfeiler-Leman (WL) test is a classical graph isomorphism algorithm, and standard message-passing GNNs are provably bounded by the 1-WL test in their ability to distinguish graph structures. This means there are pairs of non-isomorphic graphs that message-passing GNNs will always map to the same representation. They literally cannot tell them apart.

For relational data, this limitation means that certain structural patterns in the database graph (specific configurations of how entities are connected) are invisible to the model. Graph transformers, by attending globally rather than aggregating locally, can exceed this expressiveness bound.

Message-Passing GNN

Structural limitations

+Efficient local computation
+Scales to large graphs per layer
+Well-studied, mature tooling

−Multi-hop bottleneck (2-4 hop limit)
−Over-squashing loses distant signals
−Bounded by 1-WL expressiveness
−Transactions can't directly communicate

Standard Transformer

Missing graph structure

+Global attention (no hop limit)
+No over-squashing
+Proven on text and images

−Ignores graph topology entirely
−Treats all nodes as a flat sequence
−No awareness of edge types or schema
−O(n^2) attention on full graph is infeasible

Relational Graph Transformer

Best of both worlds

+Global attention with graph-aware bias
+Respects relational schema and edge types
+Exceeds 1-WL expressiveness
+Scalable via subgraph sampling

−More complex architecture
−Requires positional encoding design

03

From Database to Graph: The Conversion Pipeline

Before a Relational Graph Transformer can operate on enterprise data, the relational database must be converted into a graph representation. This conversion is not a loose analogy. It is a precise, schema-preserving transformation that proceeds in three distinct steps.

1

Entity Representation

Each database row becomes a structured entity with primary key, foreign keys, descriptive attributes, and optional timestamps.

→

2

Schema Graph

Tables become node types; foreign key relationships define edge types. This blueprint captures the topology of the database.

→

3

Relational Entity Graph

Individual rows become nodes connected through foreign key edges, forming a heterogeneous typed network.

Step 1: Entity representation

Each row in every table is treated as an entity with four components: a primary key (unique identifier), foreign keys (links to other tables), descriptive attributes (the actual data columns), and optional timestamps for temporal events. A transaction row, for example, has its own ID, a foreign key to the customer who made it, attributes like amount and payment method, and a timestamp for when it occurred.

Step 2: Schema graph

The database schema itself becomes a graph. Each table is a node type. Each foreign key relationship is an edge type. An e-commerce database with customers, transactions, and products tables produces a schema graph with three node types and two edge types (customer-to-transaction, transaction-to-product). This schema graph is the blueprint: it defines what kinds of nodes and edges will appear in the full entity graph.

Step 3: Relational entity graph

The actual data populates the schema blueprint. Every row becomes a concrete node of its table's type. Every foreign key value creates an edge to the referenced row. The result is a heterogeneous graph where “Customer” nodes are fundamentally different from “Product” nodes, and “purchased” edges are different from “reviewed” edges. This type information is preserved throughout the entire transformer pipeline.

Handling multi-modal attributes

Real database columns contain diverse data types: numbers, categories, free text, images, timestamps. The architecture handles each modality with a specialized encoder before fusing them into unified node features:

Numerical features pass through MLPs after normalization.
Categorical variables use learnable embedding tables.
Text fields employ pre-trained language models like Sentence-BERT.
Image data leverages pre-trained CNNs or vision transformers.
Timestamps are treated as categorical, continuous, or cyclic features depending on context.

All modality embeddings are fused into single row vectors using PyTorch Frame, a modular framework for multi-modal tabular learning. These fused vectors serve as the initial node features for the transformer.

04

The Relational Graph Transformer Architecture

The core architectural innovation is straightforward in principle: replace message-passing with attention, but do it in a way that respects the relational graph's structure. A naive approach would apply standard transformer attention across all nodes in the graph. This fails for two reasons: it ignores the graph topology entirely, and it requires O(n²) computation across millions of nodes, which is infeasible at enterprise scale.

Relational Graph Transformers solve both problems through three key design decisions: graph-aware attention, relation-aware edge encoding, and scalable subgraph sampling.

Graph-aware attention

Unlike fully connected generic transformers, the Relational Graph Transformer incorporates the relational graph's topology directly into the attention mechanism. Nodes primarily focus on their local neighbors (preserving the useful inductive bias of GNNs) while still being able to capture long-range dependencies through attention. The attention weights are biased by graph structure: nearby nodes receive higher baseline attention, but the model can learn to override this when distant nodes carry important signal.

Relation-aware edge encoding

In a relational database, not all edges are the same. A “purchased” edge between a customer and a product carries different semantics than a “reports-to” edge between two employees. The architecture incorporates relation-aware attention mechanisms with specialized weights for different edge types. This respects the database schema constraints while letting the model learn the semantic meaning of each relationship type.

Scalable subgraph sampling

Enterprise databases produce graphs with millions of nodes and billions of edges. Running global attention over the entire graph is impossible. Instead, the architecture breaks down the graph into subcomponents based on the schema, focusing on smaller, manageable subgraphs at a time. The experimental setup uses neighbor sampling with two hops of 15 neighbors, creating focused local subgraphs where the transformer can apply its full attention mechanism efficiently.

1

Multi-Modal Encoding

PyTorch Frame encodes numerical, categorical, text, image, and timestamp features into unified node vectors.

→

2

Subgraph Sampling

2-hop neighbor sampling (15 neighbors per hop) creates focused local subgraphs from the full entity graph.

→

3

Relational Attention

4 transformer layers with 8 attention heads apply graph-aware, relation-typed attention within each subgraph.

→

4

Prediction

512-dimensional FFN produces task-specific outputs: classification, regression, or ranking.

05

Positional Encodings for Relational Structure

In standard transformers for text, positional encodings tell the model where each token sits in the sequence. Without them, the transformer would treat “the cat sat on the mat” identically to “mat the on sat cat the.” For graphs, the challenge is harder: there is no natural linear ordering. Nodes exist in a topological structure, and the model needs to understand that structure to reason about relational data.

Relational Graph Transformers address this with four composable local positional encodings. The key word is “local”: these encodings are computed within each sampled subgraph, avoiding the expense of precomputing global positional information across the entire database graph.

1. Hop encoding

The simplest structural signal: how far apart are two nodes within the subgraph? Hop encoding assigns each node a distance value relative to the target node. Nodes that are one hop away (direct neighbors) get a different positional signal than nodes two hops away. This gives the transformer an explicit notion of graph proximity, similar to how sinusoidal encodings give text transformers a notion of sequence proximity.

2. Tree encoding

Relational databases often have hierarchical structure: a company has departments, departments have teams, teams have employees. Tree encoding captures parent-child relationships within the subgraph, following the approaches of Shiv and Quirk (2019) and Peng et al. (2022). This allows the transformer to distinguish between a node's parent, its sibling, and its child, even when all three are equidistant in raw hop count.

3. Message passing encoding

Nodes receive random embeddings that are then refined through a lightweight Graph Neural Network. This approximates node2vec-style structural embeddings without the precomputation cost. The GNN distills local neighborhood structure into each node's positional encoding, giving the transformer access to structural features like degree, clustering coefficient, and local connectivity patterns.

4. Time encoding

For temporal relational data (transactions, events, interactions), time encoding ensures that nodes only receive information from entities with earlier timestamps. This prevents data leakage: the model cannot use future information to predict past events. Time encoding also captures temporal proximity, so the model can distinguish between a transaction that happened yesterday and one from six months ago.

Positional Encoding Strategies Compared
Encoding	What It Captures	Computation	Key Benefit
Hop Encoding	Graph distance between nodes	Local (per subgraph)	Explicit proximity signal
Tree Encoding	Parent-child hierarchy	Local (per subgraph)	Distinguishes siblings from ancestors
MP Encoding	Local topology / structural role	Light GNN pass	Approximates node2vec cheaply
Time Encoding	Temporal ordering of events	Local (per subgraph)	Prevents data leakage, captures recency

06

Benchmark Results on RelBench

The empirical evaluation uses RelBench, a public benchmark designed for predictive tasks over relational databases using graph-based models. RelBench provides standardized datasets, train/test splits, and evaluation metrics, making it possible to compare approaches on equal footing.

Experimental setup

The configuration is precise and reproducible. Both GNN and transformer models use the same neighbor sampling strategy: two hops of 15 neighbors. Node dimension is set to 128 across all models. The GNN baseline evaluates four hyperparameter configurations and reports the best result. The transformer evaluates three hyperparameter sets, reporting the best average over two runs. The transformer architecture uses four layers, eight attention heads, and a 512-dimensional feed-forward network. As an additional baseline, LightGBM is trained on raw entity table features without any graph-based feature engineering.

Head-to-head results

The results are clear. Across RelBench datasets, Graph Transformers outperform the GNN baseline by around 10% and the LightGBM baseline by over 40%. The improvement over LightGBM confirms that relational structure carries substantial predictive signal that flat-table approaches cannot access. The improvement over GNNs confirms that the transformer attention mechanism captures relational patterns that message passing misses.

Performance Comparison Across RelBench (Approximate Improvements)
Model	Architecture	vs. LightGBM	vs. GNN Baseline
LightGBM	Gradient-boosted trees on flat features	Baseline	~40% below
GNN (best of 4 configs)	Message-passing, 2-hop, dim 128	~40% above	Baseline
Relational Graph Transformer	4 layers, 8 heads, dim 128, FFN 512	over 40% above	~10% above

A critical detail: both the GNN and transformer models use the same 2-hop neighbor sampling. They see the same local neighborhood. The difference is how they process that neighborhood. The GNN aggregates information strictly along edges, one hop at a time. The transformer allows every node in the subgraph to attend to every other node directly. With identical input data, the transformer extracts more signal because it can naturally aggregate information from distant or even unconnected nodes within the sampled subgraph.

Where transformers particularly shine

The advantage is most pronounced on tasks that require reasoning across multiple relationship types and long-range dependencies. When a prediction depends on how a customer's transactions relate to product categories, and how those categories relate to seasonal trends, the transformer can capture these multi-hop, cross-table patterns directly. The GNN must compress all of that through intermediate bottleneck nodes, losing signal at each hop.

Some tasks show marginal differences between the two architectures. These tend to be tasks where the predictive signal is concentrated in immediate neighbors, where the GNN's local aggregation is sufficient. The transformer still matches GNN performance on these tasks. It does not sacrifice local reasoning ability for long-range capability.

07

From Benchmark to Production

Academic benchmarks and production systems operate under different constraints. RelBench datasets fit in memory. Enterprise graphs contain millions of nodes and billions of edges. Benchmark evaluation happens offline, once. Production systems must deliver predictions with low latency, handle cold-start entities, and remain resilient to data sparsity.

Scalability through schema-guided sampling

The subgraph sampling strategy scales naturally. Rather than computing attention over the entire graph, the system samples a local subgraph around each target entity. The schema graph guides this sampling: it determines which tables and relationship types are relevant for a given prediction task. A churn prediction focuses on customer-transaction-product subgraphs. A fraud detection task might emphasize transaction-account-device subgraphs. The schema provides the blueprint for efficient, task-relevant sampling.

Prediction speed

The 2-hop, 15-neighbor sampling creates subgraphs of bounded size. Each prediction requires a forward pass through the transformer on a subgraph of at most ~240 nodes (15 neighbors at hop 1, times 15 at hop 2, plus the target node). This is a small, fixed-size computation regardless of the total graph size. Latency is predictable and scales with the sampling budget, not the database size.

Cold-start generalization

New entities (a customer who just signed up, a product just added to the catalog) have no interaction history. Message-passing GNNs struggle here because there are no edges to pass messages along. The transformer's multi-modal feature encoding provides a baseline: even without graph neighbors, the entity's own attributes (demographics, product description, price) are encoded through PyTorch Frame and can contribute to predictions. As interactions accumulate, the graph structure enriches the representation incrementally.

Quantified production benefits

Kumo.ai reports three quantified benefits from deploying Relational Graph Transformers in production environments:

20x faster time-to-value compared to traditional feature engineering and model training pipelines.
30-50% accuracy improvements through deeper understanding of relational context that manual feature engineering cannot capture.
95% reduction in data preparation effort because the model operates directly on the relational database structure without requiring flattening or manual feature engineering.

08

Implications for Foundation Models

Relational Graph Transformers are not just a better GNN. They represent the architectural backbone for building foundation models over relational data. The connection is direct: foundation models require architectures that can generalize across diverse data distributions and task types. Message-passing GNNs, with their structural limitations, cannot serve as that backbone. Transformers can.

Why transformers enable transfer and generalization

Three properties of the Relational Graph Transformer architecture make it suitable for foundation model pre-training:

Schema-agnostic attention. The attention mechanism operates on node features and positional encodings, not on hard-coded schema structure. A model pre-trained on e-commerce data (customers, products, transactions) can transfer its learned attention patterns to healthcare data (patients, visits, doctors) because the underlying relational patterns (entity-event-entity structure, temporal sequencing, hierarchical grouping) are analogous.
Multi-modal feature encoding. PyTorch Frame handles any combination of column types. A foundation model pre-trained on databases with numerical and categorical features can process a new database with text and image columns without architectural changes.
Composable positional encodings. The four positional encoding strategies (hop, tree, message passing, time) provide structural context that generalizes across schemas. A “two hops away, same hierarchy level, recent interaction” pattern means something structurally similar whether the data is financial transactions or social network interactions.

The research trajectory

This work builds on and extends several research threads: Relational Deep Learning (arXiv:2312.04615) established the framework for converting relational databases to graphs. PyTorch Frame (arXiv:2404.00776) provided the multi-modal tabular encoding layer. RelBench (arXiv:2407.20060) created standardized benchmarks for evaluation. Relational Graph Transformers bring these pieces together: the relational data representation, the multi-modal encoding, and the scalable attention mechanism that overcomes the limitations of message-passing GNNs.

The benchmark results confirm the direction. A ~10% improvement over GNNs and a ~40% improvement over flat-table approaches on standardized benchmarks, using the same input data and sampling strategy, validates the architectural hypothesis. The remaining research questions are about scale: how much does performance improve with larger pre-training datasets, more diverse schemas, and bigger models? The transformer architecture makes these scaling experiments possible in a way that the GNN architecture does not.

Traditional ML Pipeline

One model per task

+Well-understood, interpretable
+Fast inference for simple tasks

−Manual feature engineering per task
−Cannot capture multi-hop relational patterns
−12+ hours per task for a senior data scientist
−No transfer between tasks or schemas

GNN on Relational Data

Better, but constrained

+Automatic feature learning from graph structure
+Captures local relational patterns

−Over-squashing limits long-range reasoning
−1-WL expressiveness ceiling
−Not suitable as foundation model backbone

Relational Graph Transformer

Foundation model ready

+Schema-agnostic, transfers across databases
+Multi-modal, handles any column type
+Composable positional encodings generalize
+Scalable through subgraph sampling
+~10% over GNNs, ~40% over flat-table baselines

From GNNs to Graph Transformers

Why Message-Passing GNNs Hit a Wall

The multi-hop bottleneck

Over-squashing

Limited expressiveness

Message-Passing GNN

Standard Transformer

Relational Graph Transformer

From Database to Graph: The Conversion Pipeline

Step 1: Entity representation

Step 2: Schema graph

Step 3: Relational entity graph

Handling multi-modal attributes

The Relational Graph Transformer Architecture

Graph-aware attention

Relation-aware edge encoding

Scalable subgraph sampling

Positional Encodings for Relational Structure

1. Hop encoding

2. Tree encoding

3. Message passing encoding

4. Time encoding

Benchmark Results on RelBench

Experimental setup

Head-to-head results

Where transformers particularly shine

From Benchmark to Production

Scalability through schema-guided sampling

Prediction speed

Cold-start generalization

Quantified production benefits

Implications for Foundation Models

Why transformers enable transfer and generalization

The research trajectory

Traditional ML Pipeline

GNN on Relational Data

Relational Graph Transformer

Try KumoRFM on your own data