Graph neural networks have moved from academic papers to production systems at Pinterest, DoorDash, Visa, and Google. But most explanations are either too theoretical (spectral graph convolutions) or too superficial ("like a neural network but for graphs"). These 15 questions cover what practitioners actually need to know.
1. What is a graph neural network?
A GNN is a neural network that operates on graph-structured data: nodes (entities) connected by edges (relationships). Unlike traditional neural networks that process flat vectors or regular grids, GNNs process arbitrary connection patterns.
A customer connected to 47 orders, each connected to products, each connected to categories and other customers. That is a graph. A GNN learns from that entire connected structure, not just the customer's own attributes.
2. How does a GNN differ from a CNN or RNN?
CNNs assume a regular grid structure. Every pixel has the same number of neighbors in the same positions. This works for images but fails for data where entities have varying numbers of connections in arbitrary configurations.
RNNs assume a sequential structure. Each element has one predecessor and one successor. This works for text and time series but fails for data where entities relate to many other entities simultaneously.
GNNs handle the general case: any entity can connect to any number of other entities in any pattern. This makes GNNs the natural architecture for relational databases, social networks, transaction graphs, molecular structures, and supply chains.
3. What is message passing?
Message passing is the core operation. In each GNN layer:
- Every node collects representations from its neighbors
- These representations are aggregated (summed, averaged, or attention-weighted)
- The node updates its own representation based on the aggregation and its previous state
After one layer, each node knows about its direct neighbors. After two layers, it knows about neighbors of neighbors. After three layers, it encodes information from the entire 3-hop neighborhood. For a customer node in an e-commerce graph, 3 hops covers: orders, products, other customers who bought those products, and their behavior patterns.
message_passing_example: Customer C-201
| Hop | Nodes Reached | Information Gained | Example Signal |
|---|---|---|---|
| 0 (self) | C-201 | Own attributes | credit_limit=$15K, account_age=4yr |
| 1 (neighbors) | 3 orders, 1 support ticket | Direct interactions | avg_order=$142, open_ticket=yes |
| 2 (2-hop) | 7 products, 2 agents | What they bought, who helped | high-return products, low-CSAT agent |
| 3 (3-hop) | 45 other customers | Similar customers' behavior | 34 of 45 similar customers churned |
By hop 3, the GNN knows that 75% of customers who bought the same products and had similar support experiences have churned. No flat feature table captures this 3-hop signal.
4. What types of GNNs exist?
The major architectures, in order of publication:
- GCN (2017): Applies spectral convolutions on graphs. Simple and effective but assumes a fixed graph structure.
- GraphSAGE (2017): Samples and aggregates neighbor features. Scales to large graphs because it does not require the full adjacency matrix.
- GAT (2018): Uses attention mechanisms to learn which neighbors are most informative. Different neighbors get different weights.
- GIN (2019): Provably as powerful as the Weisfeiler-Leman graph isomorphism test. Maximally expressive among message-passing GNNs.
- Graph Transformers (2020+): Combine local message passing with global self-attention. Best results on most current benchmarks. KumoRFM uses a graph transformer architecture.
gnn_architectures_compared
| Architecture | Year | Aggregation Method | Scalability | Best For |
|---|---|---|---|---|
| GCN | 2017 | Spectral convolution | Moderate (full graph) | Small homogeneous graphs |
| GraphSAGE | 2017 | Sampled neighbor mean/pool | High (mini-batch) | Large-scale production |
| GAT | 2018 | Attention-weighted | Moderate | Heterogeneous neighbor importance |
| GIN | 2019 | Sum (injective) | Moderate | Maximum expressiveness |
| Graph Transformer | 2020+ | Local + global attention | High (with sampling) | Multi-table relational data |
Graph transformers combine the best of GNNs (local structure) and transformers (global attention). KumoRFM uses this architecture for relational databases.
5. Where are GNNs used in production?
Published production deployments at scale:
- Pinterest: PinSage serves recommendations to 450 million monthly active users over a graph of 18 billion pins
- DoorDash: Heterogeneous graph of customers, restaurants, and items. 1.8% engagement lift across 30 million users.
- Visa: Fraud detection across billions of transactions, identifying fraud rings that transaction-level models miss
- Google Maps: Traffic prediction using road network graphs with real-time sensor data as node features
- Snap: Friend suggestions based on the social graph structure and interaction patterns
6. GNNs vs. transformers
Standard transformers compute self-attention between all pairs of tokens. For a sequence of length n, this is O(n^2). For a graph of 1 million nodes, full attention is computationally infeasible.
GNNs compute attention only between connected nodes, making it O(E) where E is the number of edges. This is far more efficient for sparse graphs (most real-world graphs are sparse).
Graph transformers combine both: local message passing for efficiency, plus global attention mechanisms for long-range dependencies. On RelBench, graph transformers outperform both pure GNNs and pure transformers on relational prediction tasks.
7. Can GNNs handle temporal data?
Yes. Temporal GNNs add timestamps to edges and nodes, allowing the model to distinguish between recent and historical interactions. This enables learning patterns like: recency effects (recent orders predict churn better than old orders), frequency changes (accelerating vs. decelerating activity), and seasonal patterns (holiday purchasing behavior).
On RelBench, temporal encoding improves AUROC by 2 to 5 points on tasks where recency matters (churn, next-purchase prediction).
8. What are the scaling challenges?
Three challenges, all with proven solutions:
- Neighborhood explosion: With 3 layers and an average of 50 neighbors per node, each prediction touches 50^3 = 125,000 nodes. Solution: neighbor sampling, where each layer samples a fixed number of neighbors (typically 10-25).
- Memory: A graph with 100 million nodes and 1 billion edges does not fit in GPU memory. Solution: mini-batch training with graph partitioning (Cluster-GCN, GraphSAINT) or distributed training across multiple GPUs.
- Inference latency: Real-time predictions require fast neighbor lookups. Solution: pre-computed neighbor indices, embedding caching, and hardware-optimized graph databases.
9. How much data does a GNN need?
A GNN needs a graph structure (which you derive from your relational database) and labeled examples for the prediction task. The graph structure itself acts as a regularizer, reducing the number of labels needed compared to flat models.
In practice, 10,000 to 100,000 labeled nodes are sufficient for supervised GNN training. For foundation model approaches, zero labels are needed: the model uses pre-trained knowledge to make zero-shot predictions.
10. What is a heterogeneous graph?
A graph with multiple types of nodes and edges. An e-commerce database produces a heterogeneous graph with customer nodes, product nodes, order nodes, and category nodes, connected by "purchased," "contains," "belongs_to," and "viewed" edge types.
Heterogeneous GNNs learn separate transformation functions for each node and edge type, then combine them during message passing. This is critical for relational databases where each table has different attributes and semantics.
11. How do GNNs solve cold-start problems?
A new user with zero purchase history has no features for a traditional model. But they have connections: the product they first browsed, the marketing channel they came from, the referrer who sent them. Through message passing, a GNN propagates signal from these connected entities to the new user, generating a meaningful embedding even with no direct history.
This is one of the highest-value applications of GNNs. Cold-start users are often the highest-value segment for recommendations and the hardest to serve with traditional models.
12. What is over-smoothing?
When a GNN has too many layers, all node representations converge toward a global average, losing the distinguishing information that makes individual predictions useful. With 10+ layers on most graphs, node embeddings become nearly identical.
Practical solutions: limit depth to 2-4 message passing layers (which covers 2-4 hops of relational signal), add skip connections (residual GNNs), or use jumping knowledge networks that combine representations from all layers. Graph transformers partially avoid this by adding global attention that does not depend on local message passing depth.
13. Benchmark performance
Current results on major benchmarks:
- RelBench (relational data, 30 tasks): supervised GNN 75.83 AUROC, KumoRFM zero-shot 76.71, KumoRFM fine-tuned 81.14
- OGB-MolHIV (molecular property prediction): 0.80+ AUROC with graph transformer variants
- OGB-Citation2 (academic citation link prediction): 0.87+ MRR with GraphSAGE and SEAL
- QM9 (quantum chemistry): GNNs achieve chemical accuracy on 11 of 12 molecular properties
14. Do I need to build my own GNN?
For most enterprise applications, no. Building a custom GNN requires specialized expertise (message passing schemes, neighborhood sampling, temporal encoding, graph construction) and 3 to 6 months of engineering. Relational foundation models encapsulate GNN architectures behind a simple query interface.
Build your own GNN only if: you have a unique graph structure that foundation models have not seen, you need full architectural control for competitive advantage, or your team has existing GNN expertise and a single high-stakes use case.
15. What is the future of GNNs?
Three trends are converging. First, foundation models pre-train GNN architectures on massive graph corpora, enabling zero-shot and few-shot predictions on new graphs. KumoRFM already demonstrates this for relational data.
Second, GNN + LLM integration combines textual reasoning (what does this product description mean?) with graph reasoning (what do purchase patterns around this product look like?).
Third, hardware-optimized GNN inference is pushing toward sub-millisecond latency for real-time applications. The technology is transitioning from research tool to commodity infrastructure, much like CNNs did between 2015 and 2020.