GNNs and transformers represent two fundamentally different approaches to processing structured data. GNNs know the graph structure and exploit it: each node aggregates information only from its direct neighbors. Transformers do not assume any structure: each node attends to every other node, learning which connections matter from data. Graph transformers combine both approaches.
Architecture comparison
GNN: local, sparse, structure-aware
- Receptive field: k-hop neighborhood (k = number of layers)
- Computation: sparse, scales with number of edges O(|E|)
- Inductive bias: nearby nodes are more relevant than distant nodes
- Depth limit: 2-3 layers before over-smoothing degrades performance
- New nodes: handles naturally (just connect to the graph)
Transformer: global, dense, structure-agnostic
- Receptive field: all nodes in one layer (global attention)
- Computation: dense, scales quadratically O(N^2)
- Inductive bias: none; attention patterns learned from data
- Depth limit: none (can stack 12-96 layers without over-smoothing)
- New nodes: needs positional encoding to function
When GNNs win
GNNs outperform transformers when:
- Graph structure is meaningful: molecular bonds determine chemical properties. Social connections determine influence. Foreign keys define relational semantics. When structure carries signal, exploiting it is better than learning it from scratch.
- Local patterns dominate: 2-3 hop patterns capture most of the prediction signal. Fraud rings (2-hop), functional groups in molecules (3-hop), customer purchase patterns (2-hop).
- Graphs are large: millions of nodes make global attention infeasible. A social network with 100M nodes requires neighbor sampling; global attention is impossible.
- Data is limited: GNNs need less training data because the graph structure provides a strong prior. Transformers need more data to learn the relevant attention patterns.
When transformers win
Transformers outperform GNNs when:
- Long-range dependencies matter: if nodes 5-10 hops apart need to interact, GNNs would need 5-10 layers (over-smoothing). Transformers connect them in one layer.
- Graph structure is noisy: if the given edges are incomplete or unreliable, GNNs propagate along wrong connections. Transformers can learn to ignore noisy edges.
- Graphs are small: for graphs with hundreds to low thousands of nodes, global attention is computationally feasible and captures richer interactions.
Graph transformers: the best of both
Graph transformers combine structural inductive bias with global attention:
- Structural positional encodings: Laplacian eigenvectors, random walk features, or degree encodings give each node a position-aware representation based on graph structure.
- Structure-biased attention: attention scores are modified by graph distance or edge features. Neighbors get higher base attention, but the model can still attend to distant nodes.
- Local + global layers: GPS alternates MPNN layers (local structure) with transformer layers (global attention). Each type captures different patterns.
On molecular benchmarks (ZINC, PCQM4Mv2), graph transformers outperform both pure GNNs and pure transformers. On enterprise relational data, KumoRFM's Relational Graph Transformer achieves 81.14 AUROC on RelBench, the best published result.
Practical guidance
- Large sparse graphs (100K+ nodes): start with GNNs (GraphSAGE, GAT). Transformers are too expensive.
- Small graphs (hundreds of nodes, e.g., molecules): try graph transformers (GPS, Graphormer). Best of both worlds.
- Enterprise relational data: use a relational graph transformer (KumoRFM) that combines heterogeneous type handling with transformer attention.
- Unknown structure: start with a transformer that can learn attention patterns. Add graph structure later if it helps.