What is over-squashing in GNNs?

Over-squashing is the information bottleneck that occurs when messages from exponentially many distant nodes must pass through a fixed-size vector at intermediate nodes. A node 3 hops away contributes information that gets compressed at every hop. By the time it reaches the target, the signal has been squashed into a tiny fraction of the vector, often becoming numerically undetectable.

How is over-squashing different from over-smoothing?

Over-smoothing and over-squashing are distinct problems. Over-smoothing makes all node representations converge to similar values (loss of diversity). Over-squashing makes distant information undetectable at the receiving node (loss of signal). Over-smoothing is caused by repeated averaging. Over-squashing is caused by exponential growth in the number of source nodes relative to fixed vector capacity. Both limit GNN depth but for different reasons.

What causes over-squashing?

Over-squashing is caused by graph topology, specifically bottleneck edges. When a large portion of the graph is connected to a target node only through a narrow bridge (few edges), all information from that portion must flow through those bridge edges. The fixed-size vectors at bridge nodes cannot encode exponentially growing information from distant neighborhoods. Graphs with low connectivity or tree-like structures are most susceptible.

How does graph rewiring fix over-squashing?

Graph rewiring adds shortcut edges to the graph, creating alternative paths for information flow. By adding edges between distant nodes (based on spectral properties, commute time, or random walks), rewiring reduces the effective diameter of the graph and eliminates bottleneck edges. This allows distant information to reach target nodes without being squashed through narrow bridges.

Does over-squashing affect enterprise relational data?

Yes. Enterprise graphs often have bottleneck structures. A customer connected to a merchant through a single transaction creates a narrow bridge. Information about the merchant's fraud history must pass through that single transaction node to reach the customer. If the merchant has thousands of other transactions, the fraud signal gets squashed. Graph rewiring or graph transformers solve this by providing direct paths.

Over-Squashing in GNNs: The Information Bottleneck in Multi-Hop Message Passing | Kumo.ai

Over-squashing is the information bottleneck in graph neural networks where messages from distant nodes are compressed into fixed-size vectors as they pass through intermediate nodes, causing exponential signal degradation with distance. A node 3 hops away from the target contributes its information through a chain of aggregations. At each hop, the signal is mixed with other messages and squeezed into a fixed-dimensional vector. By the time it arrives, the distant signal is a negligible fraction of the vector.

Why it matters for enterprise data

Enterprise relational databases contain long-range dependencies that span multiple tables and multiple hops. A customer's fraud risk depends on the behavior of other customers who share merchants with them (3-hop path: customer → transaction → merchant → transaction → other customer). A product's demand signal flows through supply chains with 4-5 intermediary hops.

Over-squashing makes these long-range signals undetectable. The fraud pattern in the 3-hop neighborhood must flow through the transaction and merchant nodes, where it competes with thousands of other messages for space in a 128-dimensional vector. The signal gets squashed to numerical insignificance.

How over-squashing works

Consider a target node v and a distant node u that is k hops away. The information from u must flow through a path of intermediate nodes. At each intermediate node, the message from u is aggregated with messages from all other neighbors.

The exponential growth problem

If each node has average degree d, then:

At 1 hop: ~d neighbors contribute messages
At 2 hops: ~d^2 nodes contribute (through 1-hop neighbors)
At 3 hops: ~d^3 nodes contribute
At k hops: ~d^k nodes contribute

All d^k contributions must be compressed into the same fixed-size vector (128 or 256 dimensions). Each individual contribution shrinks by a factor of roughly 1/d per hop. After 3 hops with average degree 20, each source node's contribution is approximately 1/8,000 of the vector.

Bottleneck edges amplify the problem

Over-squashing is worst at bottleneck edges: edges that serve as the only path connecting large subgraphs. If 5,000 nodes on one side of a bridge edge need to send information to the other side, all 5,000 messages must flow through a single intermediate node. The fixed-size vector at that node cannot faithfully represent 5,000 distinct messages.

Concrete example: fraud signal lost in a payment graph

Consider a payments graph:

Account A (the target) is linked to Merchant M through 1 transaction
Merchant M has 10,000 transactions from 8,000 other accounts
Account B (a known fraudster) is 3 hops from Account A through Merchant M

The fraud signal from Account B must flow: B → B's transaction → Merchant M → A's transaction → A. At Merchant M, the signal from B is averaged with 9,999 other transaction messages. At A's transaction node, the already-diluted signal competes with merchant-level features. By the time it reaches A, the fraud signal is negligible.

Measuring over-squashing

measure_over_squashing.py

import torch
from torch_geometric.nn import GCNConv

def sensitivity_analysis(model, data, target_node, source_node):
    """Measure how much target_node's output depends on source_node's input."""
    data.x.requires_grad_(True)

    # Forward pass
    out = model(data.x, data.edge_index)

    # Compute gradient of target output w.r.t. source input
    out[target_node].sum().backward()
    sensitivity = data.x.grad[source_node].norm().item()

    return sensitivity

# If sensitivity ~ 0 for a k-hop neighbor, over-squashing is occurring.
# Compare sensitivity at different hop distances:
# 1-hop: 0.45 (strong signal)
# 2-hop: 0.08 (reduced signal)
# 3-hop: 0.001 (squashed)

Jacobian sensitivity measures how much the target node's output changes when a source node's input is perturbed. Near-zero sensitivity at k hops indicates over-squashing.

Solutions

Graph rewiring: Add shortcut edges between distant nodes to reduce effective graph diameter and eliminate bottleneck edges. Most direct solution.
Multi-scale architectures: Use different receptive fields at different layers. Combine local (1-hop) and global (all-node) aggregation.
Graph transformers: Replace local message passing with global attention. Every node attends to every other node, bypassing the hop-by-hop bottleneck entirely. KumoRFM uses this approach.

Limitations and what comes next

Graph rewiring adds edges: Shortcut edges increase memory and computation. On large enterprise graphs, this can be significant. Careful selection of which edges to add (spectral-based methods) limits overhead.
Global attention is O(n^2): Graph transformers solve over-squashing but at quadratic cost in the number of nodes. Subgraph sampling keeps this manageable for production systems.
No free lunch: Removing bottlenecks can also remove useful inductive bias. Some tasks benefit from the graph's natural information flow constraints.

KumoRFM's Relational Graph Transformer combines local structure awareness with global attention, achieving 81.14 fine-tuned AUROC on RelBench tasks where local message passing GNNs plateau at 75.83.

Key Takeaways

1Over-squashing compresses distant signals into fixed-size vectors. A k-hop neighbor's contribution shrinks by ~1/d per hop, where d is the average degree. After 3 hops with degree 20, each signal is ~1/8,000 of the vector.
2It is a topological problem caused by bottleneck edges, not a training problem. Changing hyperparameters or aggregation functions does not fix it. The topology itself must change.
3Distinct from over-smoothing: over-smoothing loses diversity (all nodes converge), over-squashing loses signal (distant information vanishes). Both limit depth but for different reasons.
4Graph rewiring (adding shortcut edges) directly addresses over-squashing by providing alternative information paths that bypass bottleneck edges.
5Graph transformers bypass over-squashing entirely with global attention. KumoRFM achieves 81.14 AUROC on RelBench, surpassing local message passing GNNs (75.83) precisely because it avoids this bottleneck.

Over-Squashing: The Information Bottleneck in Multi-Hop Message Passing