The decision framework
Answer four questions to narrow from 66 layers to 2-3 candidates:
Question 1: Is your graph homogeneous or heterogeneous?
- Homogeneous (one node type, one edge type):Citation networks, social graphs, co-purchase graphs. Use GCNConv, SAGEConv, GATConv, or GINConv.
- Heterogeneous (multiple types): Relational databases, knowledge graphs, enterprise data. Use HeteroConv, RGCNConv, HGTConv, or HANConv. This is the most common case in production.
Question 2: What is the task?
- Node classification (predict node labels):GCNConv/SAGEConv baseline, GATConv if neighbor importance varies.
- Link prediction (predict missing edges):SAGEConv with link-level scoring head. Avoid spectral layers (they struggle with inductive link prediction).
- Graph classification (predict graph-level properties):GINConv (provably most expressive for graph isomorphism), PNAConv (multiple aggregators), GPSConv (long-range).
Question 3: How large is your graph?
- Small (< 100K nodes): Any layer works. Full-batch training is feasible. Try GCNConv, GAT, and TransformerConv; pick the one with best validation accuracy.
- Medium (100K - 10M nodes): Use sampling-friendly layers (SAGEConv, GATConv). NeighborLoader with 2-3 layers. ClusterGCNConv for faster training at some accuracy cost.
- Large (> 10M nodes): SAGEConv is the default. Minimize layers (2 max) and fanout. Consider LGConv (LightGCN) for recommendations where simplicity helps.
Question 4: What are your production constraints?
- Latency (< 50ms inference): GCNConv or SAGEConv. Avoid attention layers (2-3x slower). Precompute embeddings for sub-5ms serving.
- Explainability required: GATConv (attention weights provide partial explanations) or any layer with GNNExplainer. Avoid deep (4+ layer) models.
- Minimal compute budget: SGConv (simplified GCN, single matrix multiply) or APPNP (propagation without parameters per layer). 5-10x cheaper than standard GNNs.
Common architecture patterns
Fraud detection
Heterogeneous graph (accounts, transactions, merchants). Use HeteroConv wrapping GATConv per edge type, 2-3 layers. GATConv attention weights identify suspicious connections. Add temporal features for recency-aware detection.
Recommendation systems
Bipartite graph (users, items). Use LGConv (LightGCN) for collaborative filtering at scale. 2 layers. For cold-start, add content features with SAGEConv on a heterogeneous graph.
Drug discovery
Molecular graphs. Use GINConv (maximally expressive for graph isomorphism) or PNAConv (multiple aggregators). 4-5 layers (molecules are small, over-smoothing is less of an issue). Graph-level pooling for property prediction.
Knowledge graph completion
Multi-relational graph. Use RGCNConv or RGATConv with per-relation transformations. 2 layers. DistMult or RotatE scoring head for link prediction.
What breaks in production
- Benchmark overfitting: A layer that wins on Cora (2,708 nodes) may not win on your 50M-node production graph. Always validate on a representative subset of your production data.
- Ignoring training cost: GPSConv achieves SOTA on benchmarks but takes 10x longer to train than SAGEConv. If you retrain daily, training cost dominates accuracy differences.
- Architecture lock-in: Once deployed, changing architectures requires retraining, revalidation, and re-deployment. Start with a simple architecture and only upgrade when you have clear evidence it is the bottleneck.