Graph Transformers in Kumo
April 3, 2025
Federico Lopez

In this blog post, we explore why Graph Transformers are particularly well-suited for relational data and how Kumo leverages them to enhance predictive modeling. We’ll dive into their unique advantages, our approach to integrating them into Kumo platform, and the impact they have on performance across various tasks.
Relational Deep Learning allows for generalizing deep learning to relational data (data spread across multiple tables, e.g., customers-transactions-products) Kumo is a cutting-edge AI platform that enables businesses to build highly accurate predictive models directly on relational data. By leveraging advanced Graph Neural Networks (GNNs) and Graph Transformers, Kumo eliminates the need for feature engineering and complex ML pipelines. This breakthrough approach unlocks the full potential of business data, enabling personalized recommendations, fraud detection, and forecasting with unmatched speed and accuracy.
What is a Graph Transformer?
Let’s start by defining what we actually mean by a Graph Transformer. While traditional Graph Neural Networks rely on message passing between connected nodes, Graph Transformers take a different approach by leveraging self-attention to capture long-range dependencies across the entire graph.
A Graph Transformer extends the traditional Transformer architecture to process graph-structured data by incorporating relational inductive biases into the attention mechanism. This adaptation enables the model to capture both local and global dependencies among nodes without relying on explicit message passing. For a detailed explanation on Transformers, please refer to the "Attention Is All You Need" paper.
By removing the reliance on sequential message passing, Graph Transformers enable nodes to attend to any other node within the subgraph at every layer. This property allows them to effectively capture long-range dependencies and complex relational patterns, making them particularly powerful for graph-structured data.
Why Graph Transformers for Relational Data?
Graph Transformers have gained traction in research, particularly for applications like molecular modeling, where graphs are relatively small and structured. However, their adoption in large-scale, real-world industrial settings remains limited. Deploying standard Graph Neural Networks at scale is already a significant challenge, and Graph Transformers introduce additional complexity, both in terms of computational cost and scalability. At Kumo, we have engineered both GNNs and Graph Transformers to work seamlessly on massive relational datasets—where graphs contain millions of nodes and billions of edges—unlocking their potential beyond small molecule graphs and into enterprise-scale AI applications.
Among the many AI models that Kumo offers, Graph Transformers stand out for their ability to capture complex relationships in relational data. Unlike traditional models that struggle with multi-table structures, Graph Transformers excel at understanding intricate connections between entities, even when they are not directly linked. By leveraging attention mechanisms, they can efficiently model long-range dependencies, making them particularly powerful for predictions that require information from multiple, interconnected sources.

For example, imagine an e-commerce database with three tables: Products, Customers, and Transactions. In a traditional relational setup, a customer's interactions are mostly limited to their direct transactions, forming a structured but somewhat constrained view of relationships. If we use a standard message-passing GNN, transactions are always two hops away from each other, connected only through the shared customer. This means that transaction-to-transaction interactions are indirect and require multiple layers of message passing to propagate information. Furthermore, products would never directly interact with each other in a two-layer GNN, since their messages would have to pass through both a transaction and a customer, making long-range dependencies difficult to capture. This structural limitation can hinder the model’s ability to recognize complex relationships that extend beyond immediate connections.
A Graph Transformer, on the other hand, sees the data as a fully connected graph, allowing every node to directly interact with any other. This means that transactions can now exchange information without needing to pass through the shared customer, overcoming the two-hop limitation of standard message passing. Similarly, products are no longer isolated from each other, since they can directly attend to other products, transactions, or customers, capturing more complex relationships. By leveraging attention mechanisms, the model dynamically weighs the importance of these distant connections, uncovering insights that would be harder to capture with sequential message-passing approaches. This holistic perspective enables richer, more accurate predictions, making Graph Transformers particularly powerful for relational data.

Positional Encoding
In order for a Graph Transformer to make sense of relational data, we first need to talk about positional encodings. Since transformers naturally treat each subgraph as fully connected, where every node can directly interact with any other, positional information helps the model understand the structure of the data. Without it, the model would have no way of distinguishing between nodes based on their relative positions, leading to a loss of important relational context.
At Kumo, we work with graphs containing millions of nodes and billions of edges, making global positional encodings, such as Laplacian or Random Walk-based methods, very expensive to precompute. Additionally, we primarily handle temporal graphs that evolve over time, rendering static global encodings quickly outdated and potentially misleading. Instead, we focus on local positional encodings that can be efficiently derived during training, ensuring scalability without sacrificing accuracy while adapting to the dynamic nature of real-world graph data.
We introduce the following positional encodings, designed to seamlessly integrate, combine, and compose with one another. By incorporating these encodings, we provide the transformer with a sense of "where" each node is located within the subgraph, enabling more structured and meaningful predictions.
Using Time as Positional Encoding
Many real-world applications involve sequential events—customer transactions, sensor readings, or user interactions—that unfold over time. To preserve this natural progression, Kumo’s model ensures that nodes only receive information from entities with earlier timestamps, preventing data leakage and “time travel” issues while improving generalization across different time periods. To reinforce this temporal structure, we introduce a time embedding that serves as a positional encoding for the Graph Transformer, allowing it to understand when each node appeared in the graph and incorporate the time dimension into its predictions.
Graph Structural Encoding
Effectively capturing the structural relationships between nodes is crucial for making accurate predictions on relational data. A well-designed structural encoding should ensure that nodes with similar roles in the graph have similar embeddings while preserving hierarchical relationships and distinguishing distant nodes. By incorporating these structural encodings, Graph Transformers gain a richer understanding of the data, allowing them to make more precise and context-aware predictions. We introduce different methods that achieve these goals, each designed to efficiently encode graph topology while maintaining scalability.
Hop Encoding
Hop embeddings encode the distance between each node and the target entity within the subgraph, helping the transformer understand proximity relationships. This is particularly useful in relational datasets, where entities may be connected through multiple intermediate steps. By explicitly encoding how far each node is from the central entity, the model can better capture hierarchical structures and differentiate between direct interactions and more distant influences.
Tree Encoding
Trees are a specific type of graph characterized by hierarchical parent-child relationships. Recent research has introduced novel positional encoding methods tailored for tree structures, effectively capturing these hierarchical dependencies within Transformer models [Shiv and Quirk (2019), Peng et al. (2022)]. These tree-based positional encoding techniques can be adapted and extended to general graphs by developing encoding schemes that preserve the inherent parent-child relationships and overall graph topology. By doing so, we can enhance the ability of Transformer models to capture complex dependencies and structures present in various graph-based data representations.
Relative Encoding
To emphasize local information derived from the graph, we can adopt the method proposed in GRPE (Graph Relative Positional Encoding), which encodes graphs without linearization and considers both node-topology and node-edge interactions. This method introduces two sets of learnable positional encodings: topology encoding, representing the topological relations between nodes, and edge encoding, capturing the types of edges connecting nodes. By incorporating these encodings, the transformer gains a nuanced understanding of each node's position within the subgraph, facilitating more structured and meaningful predictions. However, a potential drawback of this approach is the computational overhead associated with calculating shortest-path distances on the fly, which can be resource-intensive, especially for large-scale graphs.
Random GNN Encoding
In many research settings, global node embeddings are often computed using methods like node2vec. However, as we have discussed, precomputing such embeddings is impractical at Kumo’s scale. Instead, we approximate this idea by assigning each node a random embedding and refining it through a GNN that performs message passing over the original graph structure. This allows nodes to develop positional representations influenced by their neighbors, effectively capturing local topology without requiring expensive pre-computations.

Experiments
To evaluate the effectiveness of Graph Transformers with positional encodings, we conducted experiments comparing their performance against GNNs. By testing both models on relational datasets, we aimed to understand how well Graph Transformers leverage positional information and whether they offer advantages in capturing complex relationships for predictive tasks.
Data. For our experiments, we chose RelBench, a public benchmark designed for predictive tasks over relational databases using graph-based models. RelBench provides a diverse set of databases and tasks across multiple domains, making it an ideal testbed for evaluating the effectiveness of Graph Transformers. By leveraging its structured relational data and well-defined predictive challenges, we ensure that our comparison between Graph Transformers and GNNs is both rigorous and representative of real-world applications.
Setup. To ensure a fair comparison between Graph Transformers and GNNs, we use neighbor sampling with two hops of 15 neighbors and set the node dimension to 128 for all models. For the GNN baseline, we evaluate four different hyper-parameter configurations and report the best performance. For transformers, we experiment with three different hyper-parameter sets and report the best average over two runs. In all transformer setups, we use four layers, eight attention heads, and set the feed-forward network dimension to 512 (four times the node dimension).
Results
Entity Classification
The task involves predicting binary labels for a given entity at a specific seed time. Following RelBench, we use the ROC-AUC metric for evaluation, where higher values indicate better performance.

The results reveal that Graph Transformers tend to outperform the GNN baseline on several tasks, particularly in scenarios where capturing long-range dependencies is beneficial. For instance, on the rel-event dataset, Graph Transformers achieve improvements of up to 4.2% (user-repeat) and 3.2% (user-ignore) in ROC-AUC, and on the rel-f1 dataset, the driver-dnf task shows a 4.5% gain. Similarly, the rel-trial task reports a 3.7% improvement with Graph Transformers. Although the performance differences on some datasets—such as rel-avito and rel-stack—are marginal, these results suggest that Graph Transformers have a competitive edge in many cases.
To ensure a fair comparison, we use 2-hop neighbor sampling for both models, employing the same local neighborhood. Even in this constrained setup, Graph Transformers already match or exceed GNN performance across most datasets and tasks. Unlike GNNs which require connectivity in the graph for message passing, Graph Transformers can naturally aggregate information from distant or even unconnected nodes. This capability to leverage information beyond connected entities suggests that combining Graph Transformers with diverse graph sampling techniques could further enhance their performance, presenting a promising direction for future research.
Entity Regression
Entity-level regression tasks involve predicting numerical labels of an entity at a given seed time. We use Mean Absolute Error (MAE) as our metric, where lower values indicate better performance.

The results for entity regression show that Graph Transformers perform competitively across diverse tasks, particularly excelling in rel-amazon (user-ltv), rel-event (user-attendance) and rel-trial (site-success), where they improve performance by 5.6%, 2.9% and 3.1%, respectively. While some tasks show marginal differences, the consistency of Graph Transformers across datasets reinforces their ability to effectively model relational structures. Given their flexibility in capturing long-range dependencies, their full potential may be even greater when allowed to leverage more expansive connectivity beyond the 2-hop constraint.
Conclusion
Kumo’s platform already delivers state-of-the-art performance using GNNs for predictive modeling on relational data. Furthermore, our experiments demonstrate that Graph Transformers bring additional advantages, often matching or outperforming GNNs while offering unique strengths—such as the ability to aggregate information beyond local neighborhoods.
A key advantage of Kumo’s implementation of Graph Transformers is the flexible and composable positional encodings, which seamlessly integrate into the model. By incorporating time embeddings, hop embeddings, tree embeddings and more, we ensure that transformers can effectively capture relational structures while preserving key inductive biases from GNNs. The strong results across diverse tasks further validate this approach.
Looking ahead, there are exciting opportunities to push performance even further. Smarter neighborhood sampling strategies, enhanced global context through cross-attention, and more expressive positional encodings all represent promising research directions. Through these innovations, Kumo isn't just following the frontier of relational deep learning—we're defining it, empowering you to extract unprecedented insights from your data!
Experience the Power of Graph Transformers Today—Free!
Ready to see the difference? Kumo makes it easy to get started with Graph Transformers at no cost. Simply sign up for your free trial, connect your data, and watch as your first Graph Transformer model comes to life.

Not sure which architecture is right for your needs? Let our AutoML do the heavy lifting by automatically selecting the optimal model for your dataset and predictive queries. Start transforming your graph data insights today!