Over-squashing is the information bottleneck in graph neural networks where messages from distant nodes are compressed into fixed-size vectors as they pass through intermediate nodes, causing exponential signal degradation with distance. A node 3 hops away from the target contributes its information through a chain of aggregations. At each hop, the signal is mixed with other messages and squeezed into a fixed-dimensional vector. By the time it arrives, the distant signal is a negligible fraction of the vector.
Why it matters for enterprise data
Enterprise relational databases contain long-range dependencies that span multiple tables and multiple hops. A customer's fraud risk depends on the behavior of other customers who share merchants with them (3-hop path: customer → transaction → merchant → transaction → other customer). A product's demand signal flows through supply chains with 4-5 intermediary hops.
Over-squashing makes these long-range signals undetectable. The fraud pattern in the 3-hop neighborhood must flow through the transaction and merchant nodes, where it competes with thousands of other messages for space in a 128-dimensional vector. The signal gets squashed to numerical insignificance.
How over-squashing works
Consider a target node v and a distant node u that is k hops away. The information from u must flow through a path of intermediate nodes. At each intermediate node, the message from u is aggregated with messages from all other neighbors.
The exponential growth problem
If each node has average degree d, then:
- At 1 hop: ~d neighbors contribute messages
- At 2 hops: ~d^2 nodes contribute (through 1-hop neighbors)
- At 3 hops: ~d^3 nodes contribute
- At k hops: ~d^k nodes contribute
All d^k contributions must be compressed into the same fixed-size vector (128 or 256 dimensions). Each individual contribution shrinks by a factor of roughly 1/d per hop. After 3 hops with average degree 20, each source node's contribution is approximately 1/8,000 of the vector.
Bottleneck edges amplify the problem
Over-squashing is worst at bottleneck edges: edges that serve as the only path connecting large subgraphs. If 5,000 nodes on one side of a bridge edge need to send information to the other side, all 5,000 messages must flow through a single intermediate node. The fixed-size vector at that node cannot faithfully represent 5,000 distinct messages.
Concrete example: fraud signal lost in a payment graph
Consider a payments graph:
- Account A (the target) is linked to Merchant M through 1 transaction
- Merchant M has 10,000 transactions from 8,000 other accounts
- Account B (a known fraudster) is 3 hops from Account A through Merchant M
The fraud signal from Account B must flow: B → B's transaction → Merchant M → A's transaction → A. At Merchant M, the signal from B is averaged with 9,999 other transaction messages. At A's transaction node, the already-diluted signal competes with merchant-level features. By the time it reaches A, the fraud signal is negligible.
Measuring over-squashing
import torch
from torch_geometric.nn import GCNConv
def sensitivity_analysis(model, data, target_node, source_node):
"""Measure how much target_node's output depends on source_node's input."""
data.x.requires_grad_(True)
# Forward pass
out = model(data.x, data.edge_index)
# Compute gradient of target output w.r.t. source input
out[target_node].sum().backward()
sensitivity = data.x.grad[source_node].norm().item()
return sensitivity
# If sensitivity ~ 0 for a k-hop neighbor, over-squashing is occurring.
# Compare sensitivity at different hop distances:
# 1-hop: 0.45 (strong signal)
# 2-hop: 0.08 (reduced signal)
# 3-hop: 0.001 (squashed)Jacobian sensitivity measures how much the target node's output changes when a source node's input is perturbed. Near-zero sensitivity at k hops indicates over-squashing.
Solutions
- Graph rewiring: Add shortcut edges between distant nodes to reduce effective graph diameter and eliminate bottleneck edges. Most direct solution.
- Multi-scale architectures: Use different receptive fields at different layers. Combine local (1-hop) and global (all-node) aggregation.
- Graph transformers: Replace local message passing with global attention. Every node attends to every other node, bypassing the hop-by-hop bottleneck entirely. KumoRFM uses this approach.
Limitations and what comes next
- Graph rewiring adds edges: Shortcut edges increase memory and computation. On large enterprise graphs, this can be significant. Careful selection of which edges to add (spectral-based methods) limits overhead.
- Global attention is O(n^2): Graph transformers solve over-squashing but at quadratic cost in the number of nodes. Subgraph sampling keeps this manageable for production systems.
- No free lunch: Removing bottlenecks can also remove useful inductive bias. Some tasks benefit from the graph's natural information flow constraints.
KumoRFM's Relational Graph Transformer combines local structure awareness with global attention, achieving 81.14 fine-tuned AUROC on RelBench tasks where local message passing GNNs plateau at 75.83.