Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Guide8 min read

Over-Squashing: The Information Bottleneck in Multi-Hop Message Passing

Over-squashing occurs when information from distant parts of the graph must flow through bottleneck edges and fixed-size vectors. The signal degrades exponentially with distance, making long-range dependencies nearly impossible to capture with standard message passing.

PyTorch Geometric

TL;DR

  • 1Over-squashing compresses information from exponentially many distant nodes into fixed-size vectors at intermediate nodes. Signal from k-hop neighbors degrades exponentially with k.
  • 2It is caused by graph topology, not by aggregation choice. Bottleneck edges (bridges connecting large subgraphs) force too much information through too narrow a channel.
  • 3Over-squashing is distinct from over-smoothing. Over-smoothing loses diversity (all nodes converge). Over-squashing loses signal (distant information becomes undetectable).
  • 4Graph rewiring adds shortcut edges to bypass bottlenecks. Spectral-based rewiring (adding edges between nodes with high commute time) is the most effective approach.
  • 5Graph transformers bypass over-squashing entirely with global attention. Every node attends directly to every other node, eliminating the hop-by-hop information flow.

Over-squashing is the information bottleneck in graph neural networks where messages from distant nodes are compressed into fixed-size vectors as they pass through intermediate nodes, causing exponential signal degradation with distance. A node 3 hops away from the target contributes its information through a chain of aggregations. At each hop, the signal is mixed with other messages and squeezed into a fixed-dimensional vector. By the time it arrives, the distant signal is a negligible fraction of the vector.

Why it matters for enterprise data

Enterprise relational databases contain long-range dependencies that span multiple tables and multiple hops. A customer's fraud risk depends on the behavior of other customers who share merchants with them (3-hop path: customer → transaction → merchant → transaction → other customer). A product's demand signal flows through supply chains with 4-5 intermediary hops.

Over-squashing makes these long-range signals undetectable. The fraud pattern in the 3-hop neighborhood must flow through the transaction and merchant nodes, where it competes with thousands of other messages for space in a 128-dimensional vector. The signal gets squashed to numerical insignificance.

How over-squashing works

Consider a target node v and a distant node u that is k hops away. The information from u must flow through a path of intermediate nodes. At each intermediate node, the message from u is aggregated with messages from all other neighbors.

The exponential growth problem

If each node has average degree d, then:

  • At 1 hop: ~d neighbors contribute messages
  • At 2 hops: ~d^2 nodes contribute (through 1-hop neighbors)
  • At 3 hops: ~d^3 nodes contribute
  • At k hops: ~d^k nodes contribute

All d^k contributions must be compressed into the same fixed-size vector (128 or 256 dimensions). Each individual contribution shrinks by a factor of roughly 1/d per hop. After 3 hops with average degree 20, each source node's contribution is approximately 1/8,000 of the vector.

Bottleneck edges amplify the problem

Over-squashing is worst at bottleneck edges: edges that serve as the only path connecting large subgraphs. If 5,000 nodes on one side of a bridge edge need to send information to the other side, all 5,000 messages must flow through a single intermediate node. The fixed-size vector at that node cannot faithfully represent 5,000 distinct messages.

Concrete example: fraud signal lost in a payment graph

Consider a payments graph:

  • Account A (the target) is linked to Merchant M through 1 transaction
  • Merchant M has 10,000 transactions from 8,000 other accounts
  • Account B (a known fraudster) is 3 hops from Account A through Merchant M

The fraud signal from Account B must flow: B → B's transaction → Merchant M → A's transaction → A. At Merchant M, the signal from B is averaged with 9,999 other transaction messages. At A's transaction node, the already-diluted signal competes with merchant-level features. By the time it reaches A, the fraud signal is negligible.

Measuring over-squashing

measure_over_squashing.py
import torch
from torch_geometric.nn import GCNConv

def sensitivity_analysis(model, data, target_node, source_node):
    """Measure how much target_node's output depends on source_node's input."""
    data.x.requires_grad_(True)

    # Forward pass
    out = model(data.x, data.edge_index)

    # Compute gradient of target output w.r.t. source input
    out[target_node].sum().backward()
    sensitivity = data.x.grad[source_node].norm().item()

    return sensitivity

# If sensitivity ~ 0 for a k-hop neighbor, over-squashing is occurring.
# Compare sensitivity at different hop distances:
# 1-hop: 0.45 (strong signal)
# 2-hop: 0.08 (reduced signal)
# 3-hop: 0.001 (squashed)

Jacobian sensitivity measures how much the target node's output changes when a source node's input is perturbed. Near-zero sensitivity at k hops indicates over-squashing.

Solutions

  • Graph rewiring: Add shortcut edges between distant nodes to reduce effective graph diameter and eliminate bottleneck edges. Most direct solution.
  • Multi-scale architectures: Use different receptive fields at different layers. Combine local (1-hop) and global (all-node) aggregation.
  • Graph transformers: Replace local message passing with global attention. Every node attends to every other node, bypassing the hop-by-hop bottleneck entirely. KumoRFM uses this approach.

Limitations and what comes next

  1. Graph rewiring adds edges: Shortcut edges increase memory and computation. On large enterprise graphs, this can be significant. Careful selection of which edges to add (spectral-based methods) limits overhead.
  2. Global attention is O(n^2): Graph transformers solve over-squashing but at quadratic cost in the number of nodes. Subgraph sampling keeps this manageable for production systems.
  3. No free lunch: Removing bottlenecks can also remove useful inductive bias. Some tasks benefit from the graph's natural information flow constraints.

KumoRFM's Relational Graph Transformer combines local structure awareness with global attention, achieving 81.14 fine-tuned AUROC on RelBench tasks where local message passing GNNs plateau at 75.83.

Frequently asked questions

What is over-squashing in GNNs?

Over-squashing is the information bottleneck that occurs when messages from exponentially many distant nodes must pass through a fixed-size vector at intermediate nodes. A node 3 hops away contributes information that gets compressed at every hop. By the time it reaches the target, the signal has been squashed into a tiny fraction of the vector, often becoming numerically undetectable.

How is over-squashing different from over-smoothing?

Over-smoothing and over-squashing are distinct problems. Over-smoothing makes all node representations converge to similar values (loss of diversity). Over-squashing makes distant information undetectable at the receiving node (loss of signal). Over-smoothing is caused by repeated averaging. Over-squashing is caused by exponential growth in the number of source nodes relative to fixed vector capacity. Both limit GNN depth but for different reasons.

What causes over-squashing?

Over-squashing is caused by graph topology, specifically bottleneck edges. When a large portion of the graph is connected to a target node only through a narrow bridge (few edges), all information from that portion must flow through those bridge edges. The fixed-size vectors at bridge nodes cannot encode exponentially growing information from distant neighborhoods. Graphs with low connectivity or tree-like structures are most susceptible.

How does graph rewiring fix over-squashing?

Graph rewiring adds shortcut edges to the graph, creating alternative paths for information flow. By adding edges between distant nodes (based on spectral properties, commute time, or random walks), rewiring reduces the effective diameter of the graph and eliminates bottleneck edges. This allows distant information to reach target nodes without being squashed through narrow bridges.

Does over-squashing affect enterprise relational data?

Yes. Enterprise graphs often have bottleneck structures. A customer connected to a merchant through a single transaction creates a narrow bridge. Information about the merchant's fraud history must pass through that single transaction node to reach the customer. If the merchant has thousands of other transactions, the fraud signal gets squashed. Graph rewiring or graph transformers solve this by providing direct paths.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.