What is fixed aggregation in message passing?

Fixed aggregation treats all neighbors equally or weights them only by fixed properties like node degree. GCNConv uses degree-normalized sum: each neighbor's contribution is weighted by 1/sqrt(degree_i * degree_j). This weighting is fixed by graph structure and does not depend on node features. Every neighbor's message has equal importance after degree normalization.

How does attention-based aggregation differ?

Attention-based aggregation (GAT) computes a learned importance weight for each neighbor based on both the source and target node features. The model learns that some neighbors are more relevant than others for the current prediction. For fraud detection, a customer's high-risk transaction might get attention weight 0.8 while a routine transaction gets 0.1. These weights are computed by a learned attention function, not fixed by structure.

Is attention always better than fixed aggregation?

No. On homogeneous graphs where all neighbors are genuinely equally important (like molecular graphs where all bonds matter), GCN-style fixed aggregation performs comparably to attention. Attention adds parameters and computation. It provides the most benefit when neighbors vary significantly in relevance: social networks (some friends influence more), fraud networks (some connections are more suspicious), and heterogeneous graphs (different relationship types have different importance).

Can you combine attention with message passing?

Attention IS a form of message passing. GAT performs the standard message-aggregate-update loop but uses learned attention weights in the aggregation step instead of fixed weights. The two are not alternatives; attention is an upgrade to the aggregation step within the message passing framework. In PyG, GATConv inherits from MessagePassing just like GCNConv.

Attention vs Message Passing: Fixed Aggregation vs Learned Importance Weights | Kumo.ai

The distinction between fixed aggregation and attention is about how a node weights its neighbors during message passing. In GCN-style aggregation, the weight is determined by graph structure (node degrees). In GAT-style attention, the weight is learned from node features. Both operate within the same message passing framework. The difference is a single function: how aggregation weights are computed.

Fixed aggregation: GCN approach

GCNConv computes the aggregation weight for neighbor j's message to node i as:

weight(j → i) = 1 / sqrt(degree(i) * degree(j))

This is purely structural. A neighbor with degree 100 has its message scaled down (it sends the same message to many nodes). A target with degree 100 scales all incoming messages down (it receives many messages). The normalization prevents nodes with many neighbors from dominating.

The key property: these weights are fixed before training. They depend only on graph structure, not on what the nodes represent or what task you are solving.

Attention aggregation: GAT approach

GATConv computes attention weights from node features:

attention_mechanism.py

# GAT attention mechanism (simplified)
# For each edge (j -> i):

# 1. Project source and target features
z_i = W @ h_i   # target node projection
z_j = W @ h_j   # source node projection

# 2. Compute attention score
e_ij = LeakyReLU(a @ concat(z_i, z_j))

# 3. Normalize across all neighbors of i
alpha_ij = softmax(e_ij)  # over all j in neighbors(i)

# 4. Weighted aggregation
h_i_new = sum(alpha_ij * z_j for j in neighbors(i))

The attention weight alpha_ij depends on BOTH source (j) and target (i) features. It is learned during training, not fixed by structure.

The attention score e_ij is computed from a concatenation of the target and source node features, passed through a learned vector a and a LeakyReLU activation. Softmax normalization ensures weights sum to 1 across all neighbors.

Multi-head attention

Single-head attention computes one set of importance weights. Multi-head attention runs multiple independent attention functions in parallel, each potentially focusing on a different aspect of neighbor relevance:

Head 1 might attend to neighbors with similar features (homophily signal)
Head 2 might attend to neighbors with high degree (hub signal)
Head 3 might attend to recently connected neighbors (recency signal)

The outputs are concatenated (in intermediate layers) or averaged (in the final layer). Multi-head attention is more robust and captures richer patterns than single-head.

When attention helps most

Heterogeneous neighbor relevance: when some neighbors are much more informative than others. In a fraud network, one suspicious connection carries more signal than 100 normal ones. Attention learns to upweight it.
Heterogeneous graphs: different edge types have different importance. A customer-order edge carries different signal than a customer-session edge. Attention learns type-specific relevance without separate weights per type.
Interpretability needs: attention weights are inspectable. You can see which neighbors the model considered most important for a prediction, providing explainability.

When fixed aggregation suffices

Homogeneous importance: in molecules, all chemical bonds are structurally important. No bond should be ignored. Degree normalization works well.
Computational efficiency: attention adds parameters and computation per edge. On very large graphs, fixed aggregation is faster.
Isomorphism expressiveness: GINConv (sum aggregation, no weighting) is provably maximally expressive for distinguishing graph structures. Attention weights can actually reduce expressiveness by softmax normalization.

Key Takeaways

1Fixed aggregation (GCN) weights neighbors by graph structure (degrees). Attention (GAT) weights neighbors by learned feature-based scores. Both operate within the message passing framework.
2Attention weights are task-adaptive: the same graph produces different attention patterns for fraud detection vs recommendation. Fixed weights are the same regardless of the prediction task.
3Multi-head attention runs multiple independent attention functions, capturing different aspects of neighbor relevance (feature similarity, degree, recency) and combining them.
4Attention helps most when neighbors vary in relevance (fraud networks, heterogeneous graphs) and when interpretability matters (inspectable attention weights). Fixed aggregation suffices when all neighbors are equally important (molecules).
5Attention is not an alternative to message passing. It is an enhanced aggregation function within message passing. GAT inherits from MessagePassing in PyG, just like GCN.

Attention vs Message Passing: Fixed Aggregation vs Learned Importance Weights