Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Production8 min read

Choosing the Right GNN Architecture for Your Task

PyG has 66 GNN layers. You need one. This guide provides a decision framework based on your graph type, task, scale, and production constraints. No architecture is universally best.

PyTorch Geometric

TL;DR

  • 1Start simple: GCNConv for homogeneous graphs, SAGEConv for large graphs, HeteroConv for heterogeneous graphs. Upgrade only when the baseline falls short.
  • 2Use 2-3 layers. Over-smoothing degrades performance beyond 3-4 layers on most graphs. Deeper is not better for GNNs.
  • 3Match the layer to the graph property that matters most: attention (GATConv) for variable-importance neighbors, heterogeneous layers (HGTConv) for multi-type graphs, scalable layers (SAGEConv) for large graphs.
  • 4Production constraints (latency, memory, explainability) often matter more than benchmark accuracy. A faster model refreshed hourly beats a slower model refreshed daily.

The decision framework

Answer four questions to narrow from 66 layers to 2-3 candidates:

Question 1: Is your graph homogeneous or heterogeneous?

  • Homogeneous (one node type, one edge type):Citation networks, social graphs, co-purchase graphs. Use GCNConv, SAGEConv, GATConv, or GINConv.
  • Heterogeneous (multiple types): Relational databases, knowledge graphs, enterprise data. Use HeteroConv, RGCNConv, HGTConv, or HANConv. This is the most common case in production.

Question 2: What is the task?

  • Node classification (predict node labels):GCNConv/SAGEConv baseline, GATConv if neighbor importance varies.
  • Link prediction (predict missing edges):SAGEConv with link-level scoring head. Avoid spectral layers (they struggle with inductive link prediction).
  • Graph classification (predict graph-level properties):GINConv (provably most expressive for graph isomorphism), PNAConv (multiple aggregators), GPSConv (long-range).

Question 3: How large is your graph?

  • Small (< 100K nodes): Any layer works. Full-batch training is feasible. Try GCNConv, GAT, and TransformerConv; pick the one with best validation accuracy.
  • Medium (100K - 10M nodes): Use sampling-friendly layers (SAGEConv, GATConv). NeighborLoader with 2-3 layers. ClusterGCNConv for faster training at some accuracy cost.
  • Large (> 10M nodes): SAGEConv is the default. Minimize layers (2 max) and fanout. Consider LGConv (LightGCN) for recommendations where simplicity helps.

Question 4: What are your production constraints?

  • Latency (< 50ms inference): GCNConv or SAGEConv. Avoid attention layers (2-3x slower). Precompute embeddings for sub-5ms serving.
  • Explainability required: GATConv (attention weights provide partial explanations) or any layer with GNNExplainer. Avoid deep (4+ layer) models.
  • Minimal compute budget: SGConv (simplified GCN, single matrix multiply) or APPNP (propagation without parameters per layer). 5-10x cheaper than standard GNNs.

Common architecture patterns

Fraud detection

Heterogeneous graph (accounts, transactions, merchants). Use HeteroConv wrapping GATConv per edge type, 2-3 layers. GATConv attention weights identify suspicious connections. Add temporal features for recency-aware detection.

Recommendation systems

Bipartite graph (users, items). Use LGConv (LightGCN) for collaborative filtering at scale. 2 layers. For cold-start, add content features with SAGEConv on a heterogeneous graph.

Drug discovery

Molecular graphs. Use GINConv (maximally expressive for graph isomorphism) or PNAConv (multiple aggregators). 4-5 layers (molecules are small, over-smoothing is less of an issue). Graph-level pooling for property prediction.

Knowledge graph completion

Multi-relational graph. Use RGCNConv or RGATConv with per-relation transformations. 2 layers. DistMult or RotatE scoring head for link prediction.

What breaks in production

  • Benchmark overfitting: A layer that wins on Cora (2,708 nodes) may not win on your 50M-node production graph. Always validate on a representative subset of your production data.
  • Ignoring training cost: GPSConv achieves SOTA on benchmarks but takes 10x longer to train than SAGEConv. If you retrain daily, training cost dominates accuracy differences.
  • Architecture lock-in: Once deployed, changing architectures requires retraining, revalidation, and re-deployment. Start with a simple architecture and only upgrade when you have clear evidence it is the bottleneck.

Frequently asked questions

Which GNN layer should I start with?

Start with GCNConv for homogeneous undirected graphs, SAGEConv for large graphs requiring sampling, and HeteroConv wrapping SAGEConv for heterogeneous graphs. These are the strongest baselines. Only switch to more complex layers (GAT, HGT, GPS) if the baseline underperforms on your validation set.

How many GNN layers should I use?

2-3 layers for most tasks. Each layer extends the receptive field by one hop. 2 layers = 2-hop neighborhood. Beyond 3 layers, over-smoothing degrades performance on most graphs. Use 3 layers only if your task requires longer-range dependencies (e.g., supply chain prediction).

When should I use a Graph Transformer instead of message-passing GNNs?

Use Graph Transformers (GPSConv, TransformerConv) when: (1) your graph has heterogeneous node/edge types, (2) you need to capture long-range dependencies beyond 3 hops, or (3) you have a large compute budget. For simple homogeneous graphs with local tasks, message-passing GNNs are faster and equally accurate.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.