Original Paper
Attention-based Graph Neural Network for Semi-Supervised Learning
Thekumparampil et al. (2018). arXiv 2018
Read paper →What AGNNConv does
AGNNConv performs attention-weighted propagation using cosine similarity:
- Compute cosine similarity between each node and its neighbors
- Normalize via softmax to get attention weights
- Propagate neighbor features weighted by attention, scaled by learnable beta
The math (simplified)
# Cosine attention (parameter-free)
alpha_ij = softmax_j( cos(h_i, h_j) )
= softmax_j( h_i^T · h_j / (||h_i|| · ||h_j||) )
# Propagation with learnable scale
h_i' = beta · Σ_j alpha_ij · h_j
Where:
cos(h_i, h_j) = cosine similarity (no learnable params)
beta = learnable scalar (the ONLY parameter per layer)
Compare to GATConv:
GATConv params per layer: W (d*d) + a (2d) = d^2 + 2d
AGNNConv params per layer: beta (1 scalar)AGNNConv has d^2 + 2d fewer parameters per layer than GATConv. On small labeled datasets, this difference matters significantly for generalization.
PyG implementation
import torch
import torch.nn.functional as F
from torch_geometric.nn import AGNNConv
class AGNN(torch.nn.Module):
def __init__(self, in_channels, hidden_channels, out_channels,
num_layers=4):
super().__init__()
self.lin1 = torch.nn.Linear(in_channels, hidden_channels)
self.convs = torch.nn.ModuleList(
[AGNNConv() for _ in range(num_layers)]
)
self.lin2 = torch.nn.Linear(hidden_channels, out_channels)
def forward(self, x, edge_index):
x = F.relu(self.lin1(x))
for conv in self.convs:
x = conv(x, edge_index)
return self.lin2(x)
# Note: AGNNConv takes no dimension arguments - it operates on
# whatever dimension the input has. The linear layers handle
# dimension changes.
model = AGNN(dataset.num_features, 64, dataset.num_classes,
num_layers=4)AGNNConv() takes no constructor arguments for dimensions. It operates on any input dimension. Use linear layers before/after for dimension changes.
When to use AGNNConv
- Few labeled examples. When you have 20-100 labeled nodes, AGNNConv's single-parameter-per-layer design prevents overfitting that GATConv's many parameters cause.
- Deeper attention models. Stack 4-8 AGNNConv layers without parameter explosion. Each layer adds just one scalar parameter.
- When cosine similarity is a good attention proxy. If similar nodes should attend to each other (homophilic graphs), cosine similarity naturally produces good attention weights.
When not to use AGNNConv
- When attention patterns are complex. Cosine similarity captures feature alignment only. For tasks where attention depends on complex feature interactions, GATConv or TransformerConv is more expressive.
- Heterogeneous graphs. AGNNConv has no type-specific parameters. Use HGTConv for multi-type graphs.