Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Guide8 min read

GNN vs Transformer: Graph Neural Networks vs Standard Transformers Compared

GNNs exploit local graph structure through sparse neighbor aggregation. Transformers learn global attention patterns through dense all-to-all computation. Graph transformers combine both, and they are the architecture powering the next generation of graph ML.

PyTorch Geometric

TL;DR

  • 1GNNs aggregate from local graph neighbors (1 hop per layer, sparse computation). Transformers attend to all nodes simultaneously (global attention, dense computation). The tradeoff: structural inductive bias vs computational flexibility.
  • 2GNNs excel on large sparse graphs with meaningful structure (molecular bonds, social networks, database foreign keys). Transformers excel on small graphs with long-range dependencies or noisy structure.
  • 3GNNs hit a depth ceiling: over-smoothing limits practical depth to 2-3 layers, restricting receptive field. Transformers have no depth ceiling but scale quadratically with graph size.
  • 4Graph transformers combine both: structural positional encodings from the graph plus global attention. GPS (General Powerful Scalable) and Graphormer achieve state-of-the-art by using graph structure to bias transformer attention.
  • 5KumoRFM uses a Relational Graph Transformer: transformer attention biased by relational graph structure, combining the expressiveness of transformers with the structural inductive bias of GNNs.

GNNs and transformers represent two fundamentally different approaches to processing structured data. GNNs know the graph structure and exploit it: each node aggregates information only from its direct neighbors. Transformers do not assume any structure: each node attends to every other node, learning which connections matter from data. Graph transformers combine both approaches.

Architecture comparison

GNN: local, sparse, structure-aware

  • Receptive field: k-hop neighborhood (k = number of layers)
  • Computation: sparse, scales with number of edges O(|E|)
  • Inductive bias: nearby nodes are more relevant than distant nodes
  • Depth limit: 2-3 layers before over-smoothing degrades performance
  • New nodes: handles naturally (just connect to the graph)

Transformer: global, dense, structure-agnostic

  • Receptive field: all nodes in one layer (global attention)
  • Computation: dense, scales quadratically O(N^2)
  • Inductive bias: none; attention patterns learned from data
  • Depth limit: none (can stack 12-96 layers without over-smoothing)
  • New nodes: needs positional encoding to function

When GNNs win

GNNs outperform transformers when:

  • Graph structure is meaningful: molecular bonds determine chemical properties. Social connections determine influence. Foreign keys define relational semantics. When structure carries signal, exploiting it is better than learning it from scratch.
  • Local patterns dominate: 2-3 hop patterns capture most of the prediction signal. Fraud rings (2-hop), functional groups in molecules (3-hop), customer purchase patterns (2-hop).
  • Graphs are large: millions of nodes make global attention infeasible. A social network with 100M nodes requires neighbor sampling; global attention is impossible.
  • Data is limited: GNNs need less training data because the graph structure provides a strong prior. Transformers need more data to learn the relevant attention patterns.

When transformers win

Transformers outperform GNNs when:

  • Long-range dependencies matter: if nodes 5-10 hops apart need to interact, GNNs would need 5-10 layers (over-smoothing). Transformers connect them in one layer.
  • Graph structure is noisy: if the given edges are incomplete or unreliable, GNNs propagate along wrong connections. Transformers can learn to ignore noisy edges.
  • Graphs are small: for graphs with hundreds to low thousands of nodes, global attention is computationally feasible and captures richer interactions.

Graph transformers: the best of both

Graph transformers combine structural inductive bias with global attention:

  • Structural positional encodings: Laplacian eigenvectors, random walk features, or degree encodings give each node a position-aware representation based on graph structure.
  • Structure-biased attention: attention scores are modified by graph distance or edge features. Neighbors get higher base attention, but the model can still attend to distant nodes.
  • Local + global layers: GPS alternates MPNN layers (local structure) with transformer layers (global attention). Each type captures different patterns.

On molecular benchmarks (ZINC, PCQM4Mv2), graph transformers outperform both pure GNNs and pure transformers. On enterprise relational data, KumoRFM's Relational Graph Transformer achieves 81.14 AUROC on RelBench, the best published result.

Practical guidance

  • Large sparse graphs (100K+ nodes): start with GNNs (GraphSAGE, GAT). Transformers are too expensive.
  • Small graphs (hundreds of nodes, e.g., molecules): try graph transformers (GPS, Graphormer). Best of both worlds.
  • Enterprise relational data: use a relational graph transformer (KumoRFM) that combines heterogeneous type handling with transformer attention.
  • Unknown structure: start with a transformer that can learn attention patterns. Add graph structure later if it helps.

Frequently asked questions

What is the fundamental difference between GNNs and transformers?

GNNs aggregate information from local graph neighbors (1-hop per layer). Transformers attend to all tokens/nodes simultaneously (global attention). GNNs exploit known graph structure; transformers learn attention patterns from data. GNNs are sparse (computation scales with edges); transformers are dense (computation scales quadratically with nodes).

When should you use a GNN over a transformer?

Use a GNN when: (1) the graph structure is known and meaningful (molecular bonds, social connections, database foreign keys), (2) local neighborhood patterns are the primary signal (2-3 hop patterns), (3) the graph is large (millions of nodes) and sparse, making global attention infeasible. GNNs are more parameter-efficient and scale better on large sparse graphs.

When should you use a transformer over a GNN?

Use a transformer when: (1) long-range dependencies matter (nodes 5+ hops apart need to interact), (2) the graph structure is noisy or incomplete (the true dependencies are not captured by the given edges), (3) the graph is small enough for global attention (thousands of nodes, not millions). Transformers can learn to attend to distant nodes that GNNs cannot reach without many layers.

What is a graph transformer?

A graph transformer combines GNN-style structural encoding with transformer-style global attention. Nodes attend to all other nodes (like a transformer) but attention scores are biased by graph structure (like a GNN). This captures both local structural patterns and long-range dependencies. Examples: GPS (General Powerful Scalable), GraphGPS, Graphormer. KumoRFM uses a Relational Graph Transformer.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.