Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
April 2025

An Introduction to Graph Transformers

How attention mechanisms adapt to graph-structured data, why standard transformers fall short, and the key innovations driving a new class of graph-aware architectures.

Federico LopezMatthias FeyJure Leskovec
01

Graphs Are Everywhere

Graphs are one of the most general and expressive data structures in computer science. A graph consists of nodes (entities) and edges (relationships between entities). This simple abstraction captures an enormous range of real-world systems.

Social networks model people as nodes and friendships as edges. Molecular graphs represent atoms as nodes and chemical bonds as edges. Financial transaction networks connect accounts through payment flows. Knowledge graphs link concepts through typed relationships. Relational databases, when viewed through their foreign-key structure, form heterogeneous graphs connecting rows across tables.

What makes graphs powerful is that the structure of connections carries information. In a social network, a person's position within a community (bridge node, hub, peripheral member) is as predictive as their profile attributes. In a molecular graph, the arrangement of bonds determines a compound's properties. In a fraud network, tightly connected clusters of accounts transacting with each other reveal suspicious patterns that no individual transaction would expose.

The challenge is that graphs are inherently non-Euclidean. Unlike images (regular grids of pixels) or text (ordered sequences of tokens), graphs have no fixed ordering of nodes, variable neighborhood sizes, and complex topological structure. This makes standard deep learning architectures (CNNs, RNNs, and vanilla transformers) a poor fit without significant adaptation.

02

How Transformers Work: The Attention Mechanism

Before understanding Graph Transformers, it helps to understand the standard transformer architecture that they adapt. The transformer was originally designed for sequential data (text), and its core innovation is the self-attention mechanism: a way for each element in the input to dynamically weigh the importance of every other element.

Self-attention in four steps

1

Linear Projections

Each input token is projected into three vectors: Query (Q), Key (K), and Value (V) via learned weight matrices.

2

Attention Scores

Compute Attention(Q,K,V) = softmax(QK^T / sqrt(d_k))V. The dot product measures how relevant each key is to each query.

3

Multi-Head Attention

Run multiple attention heads in parallel, each learning different relationship patterns. Concatenate and project the results.

4

FFN + Residuals

Apply a position-wise feedforward network with residual connections and layer normalization for training stability.

The Query represents what each token is “looking for.” The Key represents what each token “offers.” The Value is the actual information that gets passed along. The softmax normalization ensures attention weights sum to 1, creating a proper probability distribution over the input.

Multi-head attention is critical because a single attention head can only capture one type of relationship. With multiple heads running in parallel, the model learns to attend to different aspects simultaneously: one head might focus on syntactic structure, another on semantic similarity, a third on positional proximity.

Why transformers excel on sequences

The key advantage over recurrent architectures is that attention operates over all positions simultaneously. There is no sequential bottleneck: token 1 can directly attend to token 500 in a single step, rather than propagating information through 499 intermediate states. This enables parallel computation and captures long-range dependencies directly.

For sequences, positional information is injected through sinusoidal or learned positional encodings that tell the model where each token sits in the sequence. This is straightforward because sequences have a natural linear ordering.

03

Why Standard Transformers Struggle with Graphs

Standard transformers treat their input as a fully connected set of tokens: every token can directly attend to every other token. This works well for text, where the context window is the entire sentence or document. But when applied directly to graph data, three fundamental problems arise.

1. No notion of graph connectivity

In a standard transformer, the attention mechanism does not know which nodes are connected by edges and which are not. Every node attends to every other node with equal structural opportunity. This means the model must learn graph structure entirely from the data, with no architectural inductive bias. For sparse graphs (where most node pairs are not connected), this wastes enormous capacity learning that most pairs should have low attention.

2. No meaningful positional encoding

Sequential positional encodings (sinusoidal or learned indices) assume a linear ordering. Graphs have no canonical ordering of nodes. You could arbitrarily number the nodes 1 through N, but this numbering carries no structural meaning. Two nodes that are adjacent in the graph might receive positions 1 and 500, while two disconnected nodes might receive positions 1 and 2.

3. No edge feature incorporation

Standard transformers have no mechanism for edge features. In a molecular graph, bond types (single, double, aromatic) are critical. In a knowledge graph, relationship types (is-a, part-of, located-in) carry essential semantics. In a relational database graph, foreign-key types define the nature of connections. The standard transformer architecture has no place to inject this information.

Standard Transformer vs. Graph Transformer: Structural Comparison
AspectStandard TransformerGraph Transformer
Data StructureOrdered sequencesArbitrary graphs
Attention ScopeFully connected (all-to-all)Topology-aware (structure-guided)
Positional EncodingAbsolute (sinusoidal/learned)Graph-aware (Laplacian, random walk)
Edge AwarenessNo explicit edge modelingExplicit edge features in attention
ComplexityO(N²) over all tokensReducible via sparse/local attention

The computational cost is also a concern. Full self-attention has O(N²) complexity, where N is the number of nodes. For small graphs (molecules with tens of atoms), this is manageable. For large-scale graphs (social networks with millions of nodes, relational databases with billions of rows), quadratic complexity becomes prohibitive.

04

What Are Graph Transformers?

Graph Transformers adapt the transformer architecture to handle graph-structured data. The core idea is to integrate self-attention mechanisms with the graph's topology, so that nodes can attend to information across the graph while respecting (or being informed by) its structural properties.

The architecture retains the transformer's strengths (parallel computation, direct long-range dependencies, multi-head attention) while adding graph-specific inductive biases through three main mechanisms.

1

Graph-Aware Attention

Incorporate graph topology into the attention mechanism. Nodes attend preferentially to neighbors, or attention weights are biased by structural distance.

2

Positional Encodings

Replace sequential positions with graph-aware encodings (Laplacian eigenvectors, random walk probabilities) that capture a node's structural role.

3

Edge Feature Integration

Incorporate edge features (bond types, relationship types, foreign-key semantics) directly into the attention computation.

How attention becomes graph-aware

The simplest approach restricts attention to a node's local neighborhood: each node only attends to nodes it is directly connected to, rather than all nodes in the graph. This introduces a strong locality bias and reduces computation from O(N²) to O(E), where E is the number of edges.

A more flexible approach uses attention bias: the model computes full attention but adds a bias term based on the graph distance between node pairs. Connected nodes receive a positive bias, distant nodes receive a negative bias. This lets the model balance local structure with global context.

Some architectures combine both: local attention within neighborhoods plus global attention with selected anchor nodes, giving each node access to both nearby detail and distant context.

Edge features in attention

Graph Transformers can incorporate edge features directly into the attention weight computation. Instead of computing attention purely from node representations (Q and K), the model also conditions on the features of the edge connecting the query and key nodes. This is essential for heterogeneous graphs where the type and properties of relationships vary. In a molecular graph, whether two atoms share a single bond or a double bond changes how information should flow. In a relational database graph, a “purchased” edge carries different semantics than a “reviewed” edge.

05

Positional and Structural Encodings

One of the most active research areas in Graph Transformers is how to encode a node's position and structural context. Unlike sequences, graphs have no inherent ordering. Two different encodings answer two distinct questions.

Positional encodings: “Where am I in the graph?”

Positional encodings capture a node's location relative to other nodes. They answer questions like: how far am I from other nodes? Am I in the center of the graph or on the periphery? Am I in a dense cluster or a bridge between clusters?

Three categories of positional encodings exist.

  • Local (node-level). Distance to cluster centroid, reachability via m-step random walks. These capture a node's immediate context.
  • Global (node-level). Eigenvectors of the graph Laplacian or adjacency matrix, distance to the graph centroid, connected component identifiers. The Laplacian eigenvectors are particularly important: they provide a spectral decomposition of the graph that captures its global geometry, analogous to how Fourier basis functions capture the structure of signals.
  • Relative (edge-level). Pairwise distances computed via heat kernels, random walk metrics, or gradients of eigenvectors. These encode how two specific nodes relate to each other structurally.

Structural encodings: “What does my neighborhood look like?”

Structural encodings capture the local topology around a node, independent of its global position. Two nodes in different parts of the graph can have similar structural roles (both are hubs, both are leaf nodes, both sit at the center of triangles) even though their positions differ.

  • Local (node-level). Node degree, diagonal entries of m-step random walk matrices (self-return probabilities), counts of local substructures (triangles, rings, cliques). A node participating in many triangles is part of a tightly connected community.
  • Global (graph-level). Eigenvalues of the adjacency or Laplacian matrix, graph diameter, number of connected components, average degree. These summarize the overall structure of the graph.
  • Relative (edge-level). Gradients of local structural encodings, shared substructure indicators between node pairs. These capture how two nodes' neighborhoods overlap or interact.
Positional vs. Structural Encodings
PropertyPositional EncodingStructural Encoding
Question answeredWhere am I in the graph?What does my neighborhood look like?
Node-level examplesLaplacian eigenvectors, random walk distancesNode degree, triangle count, self-return probability
Edge-level examplesHeat kernel distances, eigenvector gradientsShared substructures, neighborhood overlap
Graph-level examplesDistance to centroid, component IDEigenvalue spectrum, diameter, avg. degree
InvariancePosition-dependentPosition-independent (role-based)

Designing effective positional encodings for graphs remains a challenging and active area of research. Unlike sequences, where position 1 always means “first token,” graph positions are not unique: the Laplacian eigenvectors have sign ambiguity (both v and −v are valid eigenvectors), and there is no canonical way to break this symmetry. Recent approaches address this by using sign-invariant neural networks or learning to align eigenvectors across graphs.

06

Graph Transformers vs. GNNs

Graph Neural Networks (GNNs) were the dominant architecture for learning on graphs before Graph Transformers emerged. Both operate on graph-structured data, but they differ fundamentally in how information flows between nodes.

Message passing in GNNs

GNNs use a message passing framework. In each layer, every node aggregates information from its direct neighbors, updates its representation, and passes this updated representation to the next layer. To capture information from nodes k hops away, you need k layers of message passing.

This sequential propagation creates two well-documented problems.

  • Over-smoothing. As the number of layers increases, node representations converge to similar values. After many rounds of neighborhood aggregation, every node has averaged information from a large fraction of the graph, losing its individual characteristics. In practice, most GNNs are limited to 2-5 layers.
  • Over-squashing. Information from distant nodes must pass through intermediate nodes, creating bottlenecks. When a single intermediate node must relay information from many upstream nodes to many downstream nodes, the fixed-size representation cannot preserve all the information. Critical signals get compressed and lost.

Self-attention in Graph Transformers

Graph Transformers replace message passing with self-attention. Each node can directly attend to any other node in the graph (or within its attention scope), without requiring information to propagate through intermediate nodes. This architectural difference has direct consequences.

GNNs (Message Passing)

Local propagation

  • +Strong locality bias matches graph structure
  • +Linear complexity in edges O(E)
  • +Well-suited for homophilic graphs
  • +Extensive theoretical analysis
  • Over-smoothing limits depth
  • Over-squashing loses distant signals
  • Long-range dependencies require many layers
  • Fixed aggregation scheme per layer

Graph Transformers

Global attention

  • +Direct long-range dependencies
  • +Less susceptible to over-smoothing
  • +Mitigates over-squashing via attention
  • +Flexible, learned aggregation
  • +Natural edge feature integration
  • O(N²) complexity without sparse attention
  • Loss of locality bias without careful design
  • Requires graph-aware positional encodings
  • More training data needed

In practice, the two approaches are complementary. Many state-of-the-art architectures combine GNN-style local message passing with transformer-style global attention. The GNN layers capture fine-grained local structure efficiently, while the transformer layers capture long-range dependencies and global context.

Handling heterogeneous graphs

Real-world graphs are often heterogeneous: they contain multiple node types and multiple edge types. Relational databases are a canonical example, with different tables representing different entity types and foreign keys representing typed relationships. Both GNNs (RGCN, HAN, HGT) and Graph Transformers (HGT, Graphormer, HEAT) have been extended to handle heterogeneous graphs, learning type-specific attention patterns and aggregation schemes.

07

Applications: From Molecules to Relational Databases

Graph Transformers have demonstrated strong results across a wide range of domains where data is naturally graph-structured.

Molecular property prediction and drug discovery

Molecules are small graphs where atoms are nodes and chemical bonds are edges. Predicting molecular properties (toxicity, solubility, binding affinity) requires understanding both local chemical motifs and global molecular shape. Graph Transformers excel here because they can capture long-range atomic interactions that message-passing GNNs miss when limited to a few layers. Protein folding, where the 3D structure of a protein depends on interactions between distant amino acids in the sequence, is another area where direct long-range attention provides clear benefits.

Fraud detection in financial networks

Financial transaction graphs connect accounts, merchants, and payment events. Fraudulent activity often involves coordinated patterns across multiple accounts: rings of accounts transacting with each other, rapid sequences of transfers, or accounts that serve as bridges between suspicious clusters. These patterns are inherently graph-structural and span multiple hops. Graph Transformers can detect these patterns by attending to distant but structurally relevant nodes, without the information loss that GNN over-squashing introduces.

Social network analysis and recommendations

Social networks are large, dynamic graphs where influence propagates through community structures. Recommending content or connections requires understanding both a user's local network (who they interact with directly) and their position in the broader community (which clusters they bridge, what communities they overlap with). Graph Transformers provide a natural framework for combining these local and global signals.

Knowledge graph reasoning

Knowledge graphs represent facts as (entity, relation, entity) triples. Answering complex queries requires multi-hop reasoning: traversing chains of relations to infer new facts. Graph Transformers can attend across multiple hops in a single layer, making them well-suited for knowledge graph completion and question answering.

Relational deep learning

Perhaps the most impactful application domain is relational data: the interconnected tables stored in enterprise databases. When viewed as a graph (rows as nodes, foreign keys as edges), a relational database becomes a large, heterogeneous, temporal graph. Learning predictive models directly on this graph structure (rather than flattening tables into feature vectors) preserves multi-hop relationships, temporal patterns, and cross-table context.

The Relational Deep Learning (RDL) framework formalized this approach, and benchmarks on RelBench have shown consistent improvements over manual feature engineering for tasks like churn prediction, fraud detection, and recommendation across real-world enterprise datasets.

Graph Transformer Application Domains
DomainGraph StructureWhy Graph Transformers Help
Molecular / Drug DiscoveryAtoms = nodes, bonds = edgesLong-range atomic interactions beyond local neighborhoods
Financial FraudAccounts + transactions as graphDetecting multi-hop coordinated fraud patterns
Social NetworksUsers = nodes, interactions = edgesCombining local network and global community signals
Knowledge Graphs(entity, relation, entity) triplesMulti-hop reasoning in a single attention layer
Relational DatabasesRows = nodes, foreign keys = edgesPreserving cross-table relationships and temporal patterns
08

Scaling Challenges and the Road to Relational Graph Transformers

Graph Transformers are powerful, but applying them to large-scale, real-world graphs requires solving the quadratic complexity problem. Full self-attention over a graph with N nodes requires O(N²) computation. For a molecular graph with 50 atoms this is fine. For a social network with 10 million users or a relational database with billions of rows, it is infeasible.

Sparse attention mechanisms

Several strategies reduce the computational cost while preserving the benefits of attention.

  • Local attention. Restrict each node to attending only to its k-hop neighborhood. This reduces complexity to O(E) and preserves the locality bias that makes GNNs effective.
  • Global anchor nodes. Select a small set of representative nodes and let all nodes attend to them. This provides global context without full O(N²) cost.
  • Low-rank approximations. Methods like Linformer and Performer approximate the full attention matrix with low-rank decompositions, reducing complexity to O(N).
  • Routing transformers. Learn sparse attention patterns dynamically, routing each node's attention to the most relevant subset of other nodes.

Subgraph sampling

For very large graphs, computing attention over the full graph is impractical even with sparse methods. Subgraph sampling trains the model on manageable subsets of the graph.

  • Node sampling: randomly select a subset of nodes and their neighborhoods
  • Layer-wise sampling: sample neighbors independently at each layer (as in GraphSAGE)
  • Graph partitioning: divide the graph into manageable partitions and process each partition with cross-partition communication
  • Cluster-based training: group related nodes into clusters and train on cluster-level subgraphs
Scaling Strategies: Trade-offs
StrategyComplexityProsCons
Full attentionO(N²)Complete global contextInfeasible for large graphs
Local attentionO(E)Efficient, strong localityLimited long-range access
Anchor nodesO(N · k)Global context at reduced costAnchor selection matters
Low-rank approx.O(N)Linear scalingApproximation quality varies
Subgraph samplingO(batch)Scales to any graph sizeVariance from sampling noise

The road to relational graph transformers

The convergence of Graph Transformers with relational data represents a particularly promising direction. Enterprise databases are large, heterogeneous, temporal graphs. They contain multiple node types (customers, orders, products), multiple edge types (purchased, reviewed, returned), rich node and edge features, and temporal dynamics that evolve over time.

Building effective Graph Transformers for relational data requires combining several of the innovations discussed in this guide: graph-aware attention that handles heterogeneous types, positional and structural encodings that capture the semantics of relational schemas, edge feature integration for typed foreign-key relationships, and scalable architectures that handle enterprise-scale data volumes.

This is the path that leads from Graph Transformers as a research architecture to Relational Graph Transformers as the backbone of foundation models for structured data. The KumoRFM research guide covers the next step in this progression: how a relational graph transformer, combined with schema-agnostic encoding and in-context learning, becomes the first foundation model for relational data.

Standard Transformers

Sequences only

  • +Proven on text, images, code
  • +Parallel computation
  • +Strong long-range modeling
  • No graph structure awareness
  • No edge features
  • Positional encodings assume ordering

GNNs

Local graph learning

  • +Native graph structure handling
  • +Efficient O(E) complexity
  • +Strong locality inductive bias
  • Over-smoothing and over-squashing
  • Limited long-range dependencies
  • Fixed aggregation per layer

Graph Transformers

Best of both worlds

  • +Graph-aware attention
  • +Direct long-range dependencies
  • +Edge feature integration
  • +Flexible positional encodings
  • +Foundation for relational models
  • O(N²) complexity without sparse attention
  • Requires careful positional encoding design
  • Loss of locality bias without graph-aware masks

Try KumoRFM on your own data

Zero-shot predictions are free. Fine-tuning is available with a trial.