What is SAGEConv in PyTorch Geometric?

SAGEConv implements the GraphSAGE layer from Hamilton, Ying & Leskovec (2017). Instead of using all neighbors like GCNConv, it samples a fixed number of neighbors and aggregates their features using mean, max, or LSTM pooling. This makes it scalable to graphs with millions or billions of nodes and supports inductive learning on unseen nodes.

What is the difference between SAGEConv and GCNConv?

GCNConv uses all neighbors with symmetric degree normalization and a single weight matrix. SAGEConv samples a fixed number of neighbors, uses separate weight matrices for the node itself and its neighbors, and supports multiple aggregation functions (mean, max, LSTM). SAGEConv also works inductively: it can generalize to nodes not seen during training.

How does neighbor sampling work in GraphSAGE?

At each layer, SAGEConv samples a fixed number of neighbors (e.g., 25 at layer 1, 10 at layer 2) rather than using all of them. This bounds the computation per node regardless of graph size. During training, PyG's NeighborLoader handles sampling automatically. At inference, you can use all neighbors or continue sampling.

Can SAGEConv handle new nodes at inference time?

Yes. SAGEConv is inductive, meaning it learns a function that aggregates neighbor features rather than memorizing fixed node embeddings. When a new node appears (e.g., a new user or product), SAGEConv computes its embedding from its neighbors' features. This is critical for production systems where new entities appear continuously.

Which companies use GraphSAGE in production?

Pinterest uses GraphSAGE (PinSage) for visual recommendations across 3 billion pins. DoorDash uses it for estimated delivery time prediction. Uber uses it for fraud detection. The sampling-based architecture makes it practical for real-time inference on massive graphs where using all neighbors is computationally infeasible.

SAGEConv: Scalable Inductive GNN Layer Explained | PyG Guide

Original Paper

Inductive Representation Learning on Large Graphs

Hamilton, Ying & Leskovec (2017). NeurIPS 2017

Read paper →

What SAGEConv does

SAGEConv (Sample and Aggregate) performs neighborhood aggregation with two key differences from GCNConv:

Sampling: Instead of using all neighbors, it samples a fixed number (e.g., 25 neighbors at layer 1, 10 at layer 2). This bounds computation per node.
Separate transforms: It applies separate weight matrices to the node's own features and to the aggregated neighbor features, then combines them. This gives it more expressiveness than GCNConv's single shared weight.

The result is a layer that scales to any graph size and generalizes to unseen nodes, which are the two requirements for production deployment that GCNConv lacks.

The math (simplified)

SAGEConv formula

h_N(i) = AGGREGATE({ h_j for j in SAMPLE(N(i), k) })
h_i'   = W_self · h_i + W_neigh · h_N(i)

Where:
  SAMPLE(N(i), k) = sample k neighbors of node i
  AGGREGATE       = mean, max, or LSTM pooling
  W_self          = weight matrix for the node itself
  W_neigh         = weight matrix for aggregated neighbors
  h_i'            = updated node representation

The separate weight matrices (W_self and W_neigh) let the model learn how much to weight a node's own features vs its neighborhood context.

PyG implementation

SAGEConv with mini-batch training using PyG's NeighborLoader:

sage_model.py

import torch
import torch.nn.functional as F
from torch_geometric.nn import SAGEConv
from torch_geometric.loader import NeighborLoader

class GraphSAGE(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels):
        super().__init__()
        self.conv1 = SAGEConv(in_channels, hidden_channels)
        self.conv2 = SAGEConv(hidden_channels, out_channels)

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.conv2(x, edge_index)
        return x

# Mini-batch training with neighbor sampling
loader = NeighborLoader(
    data,
    num_neighbors=[25, 10],  # 25 at hop 1, 10 at hop 2
    batch_size=1024,
    input_nodes=data.train_mask,
)

model = GraphSAGE(dataset.num_features, 256, dataset.num_classes)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

for batch in loader:
    optimizer.zero_grad()
    out = model(batch.x, batch.edge_index)
    loss = F.cross_entropy(out[:batch.batch_size], batch.y[:batch.batch_size])
    loss.backward()
    optimizer.step()

NeighborLoader handles the sampling. Each mini-batch contains only the sampled subgraph, keeping memory bounded regardless of full graph size.

When to use SAGEConv

Large graphs (100K+ nodes). SAGEConv's sampling makes mini-batch training practical. GCNConv on a million-node graph requires the full adjacency matrix in memory. SAGEConv processes it in constant-size batches.
New nodes at inference time. E-commerce catalogs, social networks, and content platforms add new entities constantly. SAGEConv computes embeddings for new nodes from their neighbors without retraining the model.
Production systems with latency requirements. The fixed sampling budget makes inference time predictable. You can tune the number of sampled neighbors to trade accuracy for speed.
Homogeneous graphs where you need more power than GCN. The separate weight matrices for self and neighbor features give SAGEConv more expressiveness while staying fast.

When not to use SAGEConv

1. When neighbor importance varies

SAGEConv (with mean aggregation) treats all sampled neighbors equally. In fraud detection, a transaction to a flagged merchant carries more signal than a routine purchase. For tasks where neighbor importance varies, GATConv or TransformerConv learn attention weights per edge.

2. Heterogeneous graphs

SAGEConv applies the same weight matrices to all node types and edge types. Enterprise relational databases have customers, orders, products, and merchants as different node types. Use RGCNConv, HGTConv, or HeteroConv for multi-type graphs.

3. Small graphs where full-batch training is feasible

If your graph fits in GPU memory (under ~50K nodes), the sampling introduces variance without a speed benefit. GCNConv or GATConv with full-batch training will give more stable gradients.

Comparison to alternatives

Layer	Scalability	Inductive	Attention
GCNConv	Full-batch typical	Yes (parametric)	No
SAGEConv	Mini-batch via sampling	Yes	No
GATConv	Full-batch (can combine with sampling)	Yes	Yes
ClusterGCNConv	Mini-batch via clustering	Requires re-partitioning	No

How KumoRFM builds on this

KumoRFM inherits SAGEConv's core design principles: sampling for scalability, inductive learning for new entities, and mini-batch training for bounded memory. But it extends them in three ways:

Learned attention replaces mean aggregation, so the model automatically weights important neighbors higher
Type-aware transformations handle heterogeneous relational databases with multiple node and edge types
Temporal encodings capture when events happened, not just the graph structure, which is critical for time-sensitive predictions like churn and fraud

The result: you write one line of PQL and get a production-grade graph model that combines SAGEConv's scalability with transformer-level expressiveness, no PyG code required.

Key Takeaways

1SAGEConv samples a fixed number of neighbors per node, making computation bounded regardless of graph size. This is the key insight that enabled production GNN deployments at Pinterest, DoorDash, and Uber.
2It is inductive: new nodes get embeddings at inference time without retraining. Essential for any system where entities are added continuously (users, products, content).
3Use NeighborLoader in PyG for mini-batch training. Sample 25 neighbors at hop 1 and 10 at hop 2 as a starting point. Tune based on your latency budget.
4SAGEConv treats all neighbors equally (mean aggregation). When neighbor importance varies (fraud detection, recommendations), combine sampling with attention (GATConv) or use TransformerConv.
5KumoRFM uses SAGEConv's sampling paradigm with attention, type awareness, and temporal encodings. One line of PQL replaces the entire model and training pipeline.

SAGEConv: The Layer That Made GNNs Production-Ready