Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Guide7 min read

Attention vs Message Passing: Fixed Aggregation vs Learned Importance Weights

Standard message passing treats all neighbors equally. Attention learns that some neighbors matter more than others. The difference is one function: fixed degree-normalization (GCN) vs learned importance scoring (GAT).

PyTorch Geometric

TL;DR

  • 1Fixed aggregation (GCN): all neighbors contribute equally (after degree normalization). The weight of neighbor j's message depends only on graph structure (degrees), not on node features.
  • 2Attention aggregation (GAT): each neighbor's contribution is weighted by a learned attention score computed from both source and target features. The model learns which neighbors are more relevant.
  • 3Attention is not an alternative to message passing. It is an upgrade to the aggregation step within message passing. GAT still follows the message-aggregate-update framework.
  • 4Attention provides the most benefit when neighbor relevance varies: fraud networks (suspicious vs normal connections), social networks (influential vs casual friends), heterogeneous graphs (different edge types).
  • 5Multi-head attention (using multiple independent attention functions) improves robustness by letting different heads focus on different aspects of neighbor relevance.

The distinction between fixed aggregation and attention is about how a node weights its neighbors during message passing. In GCN-style aggregation, the weight is determined by graph structure (node degrees). In GAT-style attention, the weight is learned from node features. Both operate within the same message passing framework. The difference is a single function: how aggregation weights are computed.

Fixed aggregation: GCN approach

GCNConv computes the aggregation weight for neighbor j's message to node i as:

weight(j → i) = 1 / sqrt(degree(i) * degree(j))

This is purely structural. A neighbor with degree 100 has its message scaled down (it sends the same message to many nodes). A target with degree 100 scales all incoming messages down (it receives many messages). The normalization prevents nodes with many neighbors from dominating.

The key property: these weights are fixed before training. They depend only on graph structure, not on what the nodes represent or what task you are solving.

Attention aggregation: GAT approach

GATConv computes attention weights from node features:

attention_mechanism.py
# GAT attention mechanism (simplified)
# For each edge (j -> i):

# 1. Project source and target features
z_i = W @ h_i   # target node projection
z_j = W @ h_j   # source node projection

# 2. Compute attention score
e_ij = LeakyReLU(a @ concat(z_i, z_j))

# 3. Normalize across all neighbors of i
alpha_ij = softmax(e_ij)  # over all j in neighbors(i)

# 4. Weighted aggregation
h_i_new = sum(alpha_ij * z_j for j in neighbors(i))

The attention weight alpha_ij depends on BOTH source (j) and target (i) features. It is learned during training, not fixed by structure.

The attention score e_ij is computed from a concatenation of the target and source node features, passed through a learned vector a and a LeakyReLU activation. Softmax normalization ensures weights sum to 1 across all neighbors.

Multi-head attention

Single-head attention computes one set of importance weights. Multi-head attention runs multiple independent attention functions in parallel, each potentially focusing on a different aspect of neighbor relevance:

  • Head 1 might attend to neighbors with similar features (homophily signal)
  • Head 2 might attend to neighbors with high degree (hub signal)
  • Head 3 might attend to recently connected neighbors (recency signal)

The outputs are concatenated (in intermediate layers) or averaged (in the final layer). Multi-head attention is more robust and captures richer patterns than single-head.

When attention helps most

  • Heterogeneous neighbor relevance: when some neighbors are much more informative than others. In a fraud network, one suspicious connection carries more signal than 100 normal ones. Attention learns to upweight it.
  • Heterogeneous graphs: different edge types have different importance. A customer-order edge carries different signal than a customer-session edge. Attention learns type-specific relevance without separate weights per type.
  • Interpretability needs: attention weights are inspectable. You can see which neighbors the model considered most important for a prediction, providing explainability.

When fixed aggregation suffices

  • Homogeneous importance: in molecules, all chemical bonds are structurally important. No bond should be ignored. Degree normalization works well.
  • Computational efficiency: attention adds parameters and computation per edge. On very large graphs, fixed aggregation is faster.
  • Isomorphism expressiveness: GINConv (sum aggregation, no weighting) is provably maximally expressive for distinguishing graph structures. Attention weights can actually reduce expressiveness by softmax normalization.

Frequently asked questions

What is fixed aggregation in message passing?

Fixed aggregation treats all neighbors equally or weights them only by fixed properties like node degree. GCNConv uses degree-normalized sum: each neighbor's contribution is weighted by 1/sqrt(degree_i * degree_j). This weighting is fixed by graph structure and does not depend on node features. Every neighbor's message has equal importance after degree normalization.

How does attention-based aggregation differ?

Attention-based aggregation (GAT) computes a learned importance weight for each neighbor based on both the source and target node features. The model learns that some neighbors are more relevant than others for the current prediction. For fraud detection, a customer's high-risk transaction might get attention weight 0.8 while a routine transaction gets 0.1. These weights are computed by a learned attention function, not fixed by structure.

Is attention always better than fixed aggregation?

No. On homogeneous graphs where all neighbors are genuinely equally important (like molecular graphs where all bonds matter), GCN-style fixed aggregation performs comparably to attention. Attention adds parameters and computation. It provides the most benefit when neighbors vary significantly in relevance: social networks (some friends influence more), fraud networks (some connections are more suspicious), and heterogeneous graphs (different relationship types have different importance).

Can you combine attention with message passing?

Attention IS a form of message passing. GAT performs the standard message-aggregate-update loop but uses learned attention weights in the aggregation step instead of fixed weights. The two are not alternatives; attention is an upgrade to the aggregation step within the message passing framework. In PyG, GATConv inherits from MessagePassing just like GCNConv.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.