The distinction between fixed aggregation and attention is about how a node weights its neighbors during message passing. In GCN-style aggregation, the weight is determined by graph structure (node degrees). In GAT-style attention, the weight is learned from node features. Both operate within the same message passing framework. The difference is a single function: how aggregation weights are computed.
Fixed aggregation: GCN approach
GCNConv computes the aggregation weight for neighbor j's message to node i as:
weight(j → i) = 1 / sqrt(degree(i) * degree(j))
This is purely structural. A neighbor with degree 100 has its message scaled down (it sends the same message to many nodes). A target with degree 100 scales all incoming messages down (it receives many messages). The normalization prevents nodes with many neighbors from dominating.
The key property: these weights are fixed before training. They depend only on graph structure, not on what the nodes represent or what task you are solving.
Attention aggregation: GAT approach
GATConv computes attention weights from node features:
# GAT attention mechanism (simplified)
# For each edge (j -> i):
# 1. Project source and target features
z_i = W @ h_i # target node projection
z_j = W @ h_j # source node projection
# 2. Compute attention score
e_ij = LeakyReLU(a @ concat(z_i, z_j))
# 3. Normalize across all neighbors of i
alpha_ij = softmax(e_ij) # over all j in neighbors(i)
# 4. Weighted aggregation
h_i_new = sum(alpha_ij * z_j for j in neighbors(i))The attention weight alpha_ij depends on BOTH source (j) and target (i) features. It is learned during training, not fixed by structure.
The attention score e_ij is computed from a concatenation of the target and source node features, passed through a learned vector a and a LeakyReLU activation. Softmax normalization ensures weights sum to 1 across all neighbors.
Multi-head attention
Single-head attention computes one set of importance weights. Multi-head attention runs multiple independent attention functions in parallel, each potentially focusing on a different aspect of neighbor relevance:
- Head 1 might attend to neighbors with similar features (homophily signal)
- Head 2 might attend to neighbors with high degree (hub signal)
- Head 3 might attend to recently connected neighbors (recency signal)
The outputs are concatenated (in intermediate layers) or averaged (in the final layer). Multi-head attention is more robust and captures richer patterns than single-head.
When attention helps most
- Heterogeneous neighbor relevance: when some neighbors are much more informative than others. In a fraud network, one suspicious connection carries more signal than 100 normal ones. Attention learns to upweight it.
- Heterogeneous graphs: different edge types have different importance. A customer-order edge carries different signal than a customer-session edge. Attention learns type-specific relevance without separate weights per type.
- Interpretability needs: attention weights are inspectable. You can see which neighbors the model considered most important for a prediction, providing explainability.
When fixed aggregation suffices
- Homogeneous importance: in molecules, all chemical bonds are structurally important. No bond should be ignored. Degree normalization works well.
- Computational efficiency: attention adds parameters and computation per edge. On very large graphs, fixed aggregation is faster.
- Isomorphism expressiveness: GINConv (sum aggregation, no weighting) is provably maximally expressive for distinguishing graph structures. Attention weights can actually reduce expressiveness by softmax normalization.