Which GNN layer should I start with?

Start with GCNConv for homogeneous undirected graphs, SAGEConv for large graphs requiring sampling, and HeteroConv wrapping SAGEConv for heterogeneous graphs. These are the strongest baselines. Only switch to more complex layers (GAT, HGT, GPS) if the baseline underperforms on your validation set.

How many GNN layers should I use?

2-3 layers for most tasks. Each layer extends the receptive field by one hop. 2 layers = 2-hop neighborhood. Beyond 3 layers, over-smoothing degrades performance on most graphs. Use 3 layers only if your task requires longer-range dependencies (e.g., supply chain prediction).

When should I use a Graph Transformer instead of message-passing GNNs?

Use Graph Transformers (GPSConv, TransformerConv) when: (1) your graph has heterogeneous node/edge types, (2) you need to capture long-range dependencies beyond 3 hops, or (3) you have a large compute budget. For simple homogeneous graphs with local tasks, message-passing GNNs are faster and equally accurate.

Choosing the Right GNN Architecture for Your Task | PyG Guide

The decision framework

Answer four questions to narrow from 66 layers to 2-3 candidates:

Question 1: Is your graph homogeneous or heterogeneous?

Homogeneous (one node type, one edge type):Citation networks, social graphs, co-purchase graphs. Use GCNConv, SAGEConv, GATConv, or GINConv.
Heterogeneous (multiple types): Relational databases, knowledge graphs, enterprise data. Use HeteroConv, RGCNConv, HGTConv, or HANConv. This is the most common case in production.

Question 2: What is the task?

Node classification (predict node labels):GCNConv/SAGEConv baseline, GATConv if neighbor importance varies.
Link prediction (predict missing edges):SAGEConv with link-level scoring head. Avoid spectral layers (they struggle with inductive link prediction).
Graph classification (predict graph-level properties):GINConv (provably most expressive for graph isomorphism), PNAConv (multiple aggregators), GPSConv (long-range).

Question 3: How large is your graph?

Small (< 100K nodes): Any layer works. Full-batch training is feasible. Try GCNConv, GAT, and TransformerConv; pick the one with best validation accuracy.
Medium (100K - 10M nodes): Use sampling-friendly layers (SAGEConv, GATConv). NeighborLoader with 2-3 layers. ClusterGCNConv for faster training at some accuracy cost.
Large (> 10M nodes): SAGEConv is the default. Minimize layers (2 max) and fanout. Consider LGConv (LightGCN) for recommendations where simplicity helps.

Question 4: What are your production constraints?

Latency (< 50ms inference): GCNConv or SAGEConv. Avoid attention layers (2-3x slower). Precompute embeddings for sub-5ms serving.
Explainability required: GATConv (attention weights provide partial explanations) or any layer with GNNExplainer. Avoid deep (4+ layer) models.
Minimal compute budget: SGConv (simplified GCN, single matrix multiply) or APPNP (propagation without parameters per layer). 5-10x cheaper than standard GNNs.

Common architecture patterns

Fraud detection

Heterogeneous graph (accounts, transactions, merchants). Use HeteroConv wrapping GATConv per edge type, 2-3 layers. GATConv attention weights identify suspicious connections. Add temporal features for recency-aware detection.

Recommendation systems

Bipartite graph (users, items). Use LGConv (LightGCN) for collaborative filtering at scale. 2 layers. For cold-start, add content features with SAGEConv on a heterogeneous graph.

Drug discovery

Molecular graphs. Use GINConv (maximally expressive for graph isomorphism) or PNAConv (multiple aggregators). 4-5 layers (molecules are small, over-smoothing is less of an issue). Graph-level pooling for property prediction.

Knowledge graph completion

Multi-relational graph. Use RGCNConv or RGATConv with per-relation transformations. 2 layers. DistMult or RotatE scoring head for link prediction.

What breaks in production

Benchmark overfitting: A layer that wins on Cora (2,708 nodes) may not win on your 50M-node production graph. Always validate on a representative subset of your production data.
Ignoring training cost: GPSConv achieves SOTA on benchmarks but takes 10x longer to train than SAGEConv. If you retrain daily, training cost dominates accuracy differences.
Architecture lock-in: Once deployed, changing architectures requires retraining, revalidation, and re-deployment. Start with a simple architecture and only upgrade when you have clear evidence it is the bottleneck.

Choosing the Right GNN Architecture for Your Task