Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Layer10 min read

SAGEConv: The Layer That Made GNNs Production-Ready

GraphSAGE solved the two biggest problems blocking GNNs in production: scalability and inductive learning. SAGEConv is how Pinterest recommends pins, DoorDash predicts delivery times, and Uber catches fraud. Here is how it works and when to use it.

PyTorch Geometric

TL;DR

  • 1SAGEConv samples a fixed number of neighbors instead of using all of them. This bounds computation per node regardless of graph size, making it practical for billion-node graphs.
  • 2It is inductive: it learns a generalizable aggregation function rather than memorizing node embeddings. New nodes get embeddings at inference time without retraining.
  • 3Use SAGEConv when your graph is large (100K+ nodes), when new nodes appear at inference time, or when you need mini-batch training. It is the default choice for production GNN systems.
  • 4Pinterest (PinSage), DoorDash, and Uber all use GraphSAGE variants in production. The sampling-based design enables real-time inference at scale.
  • 5KumoRFM extends the SAGEConv paradigm with attention, heterogeneous type handling, and temporal awareness, delivering production-grade graph learning without any PyG code.

Original Paper

Inductive Representation Learning on Large Graphs

Hamilton, Ying & Leskovec (2017). NeurIPS 2017

Read paper →

What SAGEConv does

SAGEConv (Sample and Aggregate) performs neighborhood aggregation with two key differences from GCNConv:

  1. Sampling: Instead of using all neighbors, it samples a fixed number (e.g., 25 neighbors at layer 1, 10 at layer 2). This bounds computation per node.
  2. Separate transforms: It applies separate weight matrices to the node's own features and to the aggregated neighbor features, then combines them. This gives it more expressiveness than GCNConv's single shared weight.

The result is a layer that scales to any graph size and generalizes to unseen nodes, which are the two requirements for production deployment that GCNConv lacks.

The math (simplified)

SAGEConv formula
h_N(i) = AGGREGATE({ h_j for j in SAMPLE(N(i), k) })
h_i'   = W_self · h_i + W_neigh · h_N(i)

Where:
  SAMPLE(N(i), k) = sample k neighbors of node i
  AGGREGATE       = mean, max, or LSTM pooling
  W_self          = weight matrix for the node itself
  W_neigh         = weight matrix for aggregated neighbors
  h_i'            = updated node representation

The separate weight matrices (W_self and W_neigh) let the model learn how much to weight a node's own features vs its neighborhood context.

PyG implementation

SAGEConv with mini-batch training using PyG's NeighborLoader:

sage_model.py
import torch
import torch.nn.functional as F
from torch_geometric.nn import SAGEConv
from torch_geometric.loader import NeighborLoader

class GraphSAGE(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels):
        super().__init__()
        self.conv1 = SAGEConv(in_channels, hidden_channels)
        self.conv2 = SAGEConv(hidden_channels, out_channels)

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.conv2(x, edge_index)
        return x

# Mini-batch training with neighbor sampling
loader = NeighborLoader(
    data,
    num_neighbors=[25, 10],  # 25 at hop 1, 10 at hop 2
    batch_size=1024,
    input_nodes=data.train_mask,
)

model = GraphSAGE(dataset.num_features, 256, dataset.num_classes)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

for batch in loader:
    optimizer.zero_grad()
    out = model(batch.x, batch.edge_index)
    loss = F.cross_entropy(out[:batch.batch_size], batch.y[:batch.batch_size])
    loss.backward()
    optimizer.step()

NeighborLoader handles the sampling. Each mini-batch contains only the sampled subgraph, keeping memory bounded regardless of full graph size.

When to use SAGEConv

  • Large graphs (100K+ nodes). SAGEConv's sampling makes mini-batch training practical. GCNConv on a million-node graph requires the full adjacency matrix in memory. SAGEConv processes it in constant-size batches.
  • New nodes at inference time. E-commerce catalogs, social networks, and content platforms add new entities constantly. SAGEConv computes embeddings for new nodes from their neighbors without retraining the model.
  • Production systems with latency requirements. The fixed sampling budget makes inference time predictable. You can tune the number of sampled neighbors to trade accuracy for speed.
  • Homogeneous graphs where you need more power than GCN. The separate weight matrices for self and neighbor features give SAGEConv more expressiveness while staying fast.

When not to use SAGEConv

1. When neighbor importance varies

SAGEConv (with mean aggregation) treats all sampled neighbors equally. In fraud detection, a transaction to a flagged merchant carries more signal than a routine purchase. For tasks where neighbor importance varies, GATConv or TransformerConv learn attention weights per edge.

2. Heterogeneous graphs

SAGEConv applies the same weight matrices to all node types and edge types. Enterprise relational databases have customers, orders, products, and merchants as different node types. Use RGCNConv, HGTConv, or HeteroConv for multi-type graphs.

3. Small graphs where full-batch training is feasible

If your graph fits in GPU memory (under ~50K nodes), the sampling introduces variance without a speed benefit. GCNConv or GATConv with full-batch training will give more stable gradients.

Comparison to alternatives

LayerScalabilityInductiveAttention
GCNConvFull-batch typicalYes (parametric)No
SAGEConvMini-batch via samplingYesNo
GATConvFull-batch (can combine with sampling)YesYes
ClusterGCNConvMini-batch via clusteringRequires re-partitioningNo

How KumoRFM builds on this

KumoRFM inherits SAGEConv's core design principles: sampling for scalability, inductive learning for new entities, and mini-batch training for bounded memory. But it extends them in three ways:

  • Learned attention replaces mean aggregation, so the model automatically weights important neighbors higher
  • Type-aware transformations handle heterogeneous relational databases with multiple node and edge types
  • Temporal encodings capture when events happened, not just the graph structure, which is critical for time-sensitive predictions like churn and fraud

The result: you write one line of PQL and get a production-grade graph model that combines SAGEConv's scalability with transformer-level expressiveness, no PyG code required.

Frequently asked questions

What is SAGEConv in PyTorch Geometric?

SAGEConv implements the GraphSAGE layer from Hamilton, Ying & Leskovec (2017). Instead of using all neighbors like GCNConv, it samples a fixed number of neighbors and aggregates their features using mean, max, or LSTM pooling. This makes it scalable to graphs with millions or billions of nodes and supports inductive learning on unseen nodes.

What is the difference between SAGEConv and GCNConv?

GCNConv uses all neighbors with symmetric degree normalization and a single weight matrix. SAGEConv samples a fixed number of neighbors, uses separate weight matrices for the node itself and its neighbors, and supports multiple aggregation functions (mean, max, LSTM). SAGEConv also works inductively: it can generalize to nodes not seen during training.

How does neighbor sampling work in GraphSAGE?

At each layer, SAGEConv samples a fixed number of neighbors (e.g., 25 at layer 1, 10 at layer 2) rather than using all of them. This bounds the computation per node regardless of graph size. During training, PyG's NeighborLoader handles sampling automatically. At inference, you can use all neighbors or continue sampling.

Can SAGEConv handle new nodes at inference time?

Yes. SAGEConv is inductive, meaning it learns a function that aggregates neighbor features rather than memorizing fixed node embeddings. When a new node appears (e.g., a new user or product), SAGEConv computes its embedding from its neighbors' features. This is critical for production systems where new entities appear continuously.

Which companies use GraphSAGE in production?

Pinterest uses GraphSAGE (PinSage) for visual recommendations across 3 billion pins. DoorDash uses it for estimated delivery time prediction. Uber uses it for fraud detection. The sampling-based architecture makes it practical for real-time inference on massive graphs where using all neighbors is computationally infeasible.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.