Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Layer7 min read

AGNNConv: Minimalist Attention for Semi-Supervised Learning

AGNNConv uses cosine similarity as a parameter-free attention mechanism. With just one learnable scalar per layer, it achieves attention-like behavior with minimal overfitting risk. It is the lightweight alternative when GATConv is too parameter-heavy for your labeled data budget.

PyTorch Geometric

TL;DR

  • 1AGNNConv computes attention via cosine similarity between node features: alpha_ij = cos(h_i, h_j). No learned attention vectors or weight matrices in the propagation.
  • 2Only one learnable parameter per layer: scalar beta that controls propagation strength. This makes it extremely parameter-efficient.
  • 3Designed for semi-supervised settings with few labels. Minimal parameters reduce overfitting compared to GATConv's heavier parameterization.
  • 4Can be stacked deeper than GATConv without overfitting, thanks to the minimal per-layer parameters.

Original Paper

Attention-based Graph Neural Network for Semi-Supervised Learning

Thekumparampil et al. (2018). arXiv 2018

Read paper →

What AGNNConv does

AGNNConv performs attention-weighted propagation using cosine similarity:

  1. Compute cosine similarity between each node and its neighbors
  2. Normalize via softmax to get attention weights
  3. Propagate neighbor features weighted by attention, scaled by learnable beta

The math (simplified)

AGNNConv formula
# Cosine attention (parameter-free)
alpha_ij = softmax_j( cos(h_i, h_j) )
         = softmax_j( h_i^T · h_j / (||h_i|| · ||h_j||) )

# Propagation with learnable scale
h_i' = beta · Σ_j alpha_ij · h_j

Where:
  cos(h_i, h_j) = cosine similarity (no learnable params)
  beta           = learnable scalar (the ONLY parameter per layer)

Compare to GATConv:
  GATConv params per layer: W (d*d) + a (2d) = d^2 + 2d
  AGNNConv params per layer: beta (1 scalar)

AGNNConv has d^2 + 2d fewer parameters per layer than GATConv. On small labeled datasets, this difference matters significantly for generalization.

PyG implementation

agnn_model.py
import torch
import torch.nn.functional as F
from torch_geometric.nn import AGNNConv

class AGNN(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels,
                 num_layers=4):
        super().__init__()
        self.lin1 = torch.nn.Linear(in_channels, hidden_channels)
        self.convs = torch.nn.ModuleList(
            [AGNNConv() for _ in range(num_layers)]
        )
        self.lin2 = torch.nn.Linear(hidden_channels, out_channels)

    def forward(self, x, edge_index):
        x = F.relu(self.lin1(x))
        for conv in self.convs:
            x = conv(x, edge_index)
        return self.lin2(x)

# Note: AGNNConv takes no dimension arguments - it operates on
# whatever dimension the input has. The linear layers handle
# dimension changes.
model = AGNN(dataset.num_features, 64, dataset.num_classes,
             num_layers=4)

AGNNConv() takes no constructor arguments for dimensions. It operates on any input dimension. Use linear layers before/after for dimension changes.

When to use AGNNConv

  • Few labeled examples. When you have 20-100 labeled nodes, AGNNConv's single-parameter-per-layer design prevents overfitting that GATConv's many parameters cause.
  • Deeper attention models. Stack 4-8 AGNNConv layers without parameter explosion. Each layer adds just one scalar parameter.
  • When cosine similarity is a good attention proxy. If similar nodes should attend to each other (homophilic graphs), cosine similarity naturally produces good attention weights.

When not to use AGNNConv

  • When attention patterns are complex. Cosine similarity captures feature alignment only. For tasks where attention depends on complex feature interactions, GATConv or TransformerConv is more expressive.
  • Heterogeneous graphs. AGNNConv has no type-specific parameters. Use HGTConv for multi-type graphs.

Frequently asked questions

What is AGNNConv in PyTorch Geometric?

AGNNConv implements the Attention-based Graph Neural Network from Thekumparampil et al. (2018). It uses cosine similarity between node features to compute attention weights, with no learnable weight matrix in the propagation step. The only learnable parameter per layer is a scalar beta that controls propagation strength.

How does AGNNConv compute attention?

AGNNConv computes attention weights as the cosine similarity between node feature vectors: alpha_ij = cos(h_i, h_j). This is parameter-free attention: no learned attention vectors or weight matrices. The learnable scalar beta scales the attention-weighted propagation.

How does AGNNConv differ from GATConv?

GATConv learns attention weights via a parameterized scoring function (concatenation + linear). AGNNConv uses parameter-free cosine similarity. AGNNConv has far fewer parameters per layer (just one scalar beta vs GATConv's attention vectors and weight matrices), making it less prone to overfitting on small datasets.

When should I use AGNNConv?

Use AGNNConv for semi-supervised node classification on homogeneous graphs where you have few labeled examples. Its minimal parameterization reduces overfitting risk. Also useful when you want attention-like behavior without the parameter overhead of GATConv.

Can AGNNConv be stacked deeply?

Yes. Because AGNNConv has very few parameters per layer (just beta), it can be stacked more deeply than GATConv without overfitting. The cosine attention acts as a soft attention that naturally adapts based on feature similarity at each layer.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.