GPSConv: General Powerful Scalable Graph Transformer | PyG Guide

Original Paper

Recipe for a General, Powerful, Scalable Graph Transformer

Rampasek et al. (2022). NeurIPS 2022

What GPSConv does

GPSConv is a modular block with four components:

Positional/structural encodings: Add information about each node's position in the graph (random walk, Laplacian eigenvectors) before the first layer.
Local MPNN: A message-passing layer (GINConv, PNAConv, etc.) that aggregates neighbor features within the graph structure.
Global attention: Multi-head self-attention across all nodes, like a standard transformer. Captures dependencies beyond the local neighborhood.
Feedforward network: An MLP applied per node after combining local and global representations.

Stack multiple GPS blocks (typically 5-10) with residual connections and layer normalization. The result is a deep graph transformer.

PyG implementation

gps_model.py

import torch
import torch.nn.functional as F
from torch_geometric.nn import GPSConv, GINConv, global_add_pool

class GPS(torch.nn.Module):
    def __init__(self, in_channels, hidden, out_channels, num_layers=5):
        super().__init__()
        self.node_emb = torch.nn.Linear(in_channels, hidden)
        self.convs = torch.nn.ModuleList()
        for _ in range(num_layers):
            # Local layer: GINConv with 2-layer MLP
            local_nn = torch.nn.Sequential(
                torch.nn.Linear(hidden, hidden),
                torch.nn.ReLU(),
                torch.nn.Linear(hidden, hidden),
            )
            local_layer = GINConv(local_nn)
            # GPSConv wraps local + global attention
            self.convs.append(GPSConv(
                channels=hidden,
                conv=local_layer,
                heads=4,
                attn_dropout=0.5,
            ))
        self.classifier = torch.nn.Linear(hidden, out_channels)

    def forward(self, x, edge_index, batch):
        x = self.node_emb(x)
        for conv in self.convs:
            x = conv(x, edge_index, batch)
        x = global_add_pool(x, batch)
        return self.classifier(x)

model = GPS(in_channels=9, hidden=64, out_channels=1, num_layers=5)

GPSConv wraps a local layer (GINConv here) and adds global multi-head attention. The batch parameter is needed for global attention to operate within each graph in the batch.

When to use GPSConv

Long-range dependency tasks. When distant nodes influence each other (molecular properties, protein function prediction), global attention captures these dependencies that local message passing misses.
Long-range graph benchmarks. Peptides-func and Peptides-struct specifically test long-range reasoning. GPS outperforms local-only models here.
Small to medium graphs. The O(N^2) global attention is practical for graphs with up to ~10K nodes per graph (molecular, protein datasets).
Research on graph transformers. GPS provides a modular framework to test different local layers, attention types, and positional encodings.

When not to use GPSConv

Large graphs. Global attention is O(N^2) in memory and compute. For graphs with 100K+ nodes, use local-only layers (TransformerConv, GATConv) or switch to linear attention.
When local structure is sufficient. On most node classification tasks (Cora, CiteSeer), 2-3 hops of local context capture all needed information. Global attention adds cost without benefit.

Frequently asked questions

What is GPSConv in PyTorch Geometric?

GPSConv implements the General, Powerful, Scalable (GPS) graph transformer from Rampasek et al. (2022). It combines a local message-passing layer (e.g., GINConv, PNAConv) with a global attention layer (e.g., multi-head self-attention) in each block. This captures both local neighborhood structure and long-range dependencies.

How does GPSConv combine local and global attention?

Each GPSConv block runs two computations in parallel: (1) a local MPNN layer that aggregates neighbor features, and (2) a global attention layer that attends to all nodes. The outputs are combined (typically summed) to produce the final node representation. This lets the model see both nearby and distant nodes.

What local layer should I use inside GPSConv?

GINConv is a common choice for maximum expressiveness. PNAConv is good when multiple aggregators help. GATConv works when local attention is important. The GPS paper found that the choice of local layer matters less than having both local and global components. Start with GINConv.

Can GPSConv scale to large graphs?

The global attention component has O(N^2) complexity, which limits it to graphs under ~10K nodes without approximation. For larger graphs, use sparse global attention, performer attention, or restrict global attention to sampled subgraphs. The 'S' in GPS stands for Scalable and refers to linear-complexity attention variants.

When should I use GPSConv vs TransformerConv?

TransformerConv applies transformer attention within local neighborhoods only. GPSConv adds global attention on top of local message passing. Use TransformerConv when local structure is sufficient (most tasks). Use GPSConv when long-range dependencies matter (molecular properties where distant atoms influence each other, long-range graph benchmarks).

GPSConv: The Graph Transformer Recipe