Original Paper
Inductive Representation Learning on Large Graphs
Hamilton, Ying & Leskovec (2017). NeurIPS 2017
Read paper →What SAGEConv does
SAGEConv (Sample and Aggregate) performs neighborhood aggregation with two key differences from GCNConv:
- Sampling: Instead of using all neighbors, it samples a fixed number (e.g., 25 neighbors at layer 1, 10 at layer 2). This bounds computation per node.
- Separate transforms: It applies separate weight matrices to the node's own features and to the aggregated neighbor features, then combines them. This gives it more expressiveness than GCNConv's single shared weight.
The result is a layer that scales to any graph size and generalizes to unseen nodes, which are the two requirements for production deployment that GCNConv lacks.
The math (simplified)
h_N(i) = AGGREGATE({ h_j for j in SAMPLE(N(i), k) })
h_i' = W_self · h_i + W_neigh · h_N(i)
Where:
SAMPLE(N(i), k) = sample k neighbors of node i
AGGREGATE = mean, max, or LSTM pooling
W_self = weight matrix for the node itself
W_neigh = weight matrix for aggregated neighbors
h_i' = updated node representationThe separate weight matrices (W_self and W_neigh) let the model learn how much to weight a node's own features vs its neighborhood context.
PyG implementation
SAGEConv with mini-batch training using PyG's NeighborLoader:
import torch
import torch.nn.functional as F
from torch_geometric.nn import SAGEConv
from torch_geometric.loader import NeighborLoader
class GraphSAGE(torch.nn.Module):
def __init__(self, in_channels, hidden_channels, out_channels):
super().__init__()
self.conv1 = SAGEConv(in_channels, hidden_channels)
self.conv2 = SAGEConv(hidden_channels, out_channels)
def forward(self, x, edge_index):
x = self.conv1(x, edge_index)
x = F.relu(x)
x = F.dropout(x, p=0.5, training=self.training)
x = self.conv2(x, edge_index)
return x
# Mini-batch training with neighbor sampling
loader = NeighborLoader(
data,
num_neighbors=[25, 10], # 25 at hop 1, 10 at hop 2
batch_size=1024,
input_nodes=data.train_mask,
)
model = GraphSAGE(dataset.num_features, 256, dataset.num_classes)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
for batch in loader:
optimizer.zero_grad()
out = model(batch.x, batch.edge_index)
loss = F.cross_entropy(out[:batch.batch_size], batch.y[:batch.batch_size])
loss.backward()
optimizer.step()NeighborLoader handles the sampling. Each mini-batch contains only the sampled subgraph, keeping memory bounded regardless of full graph size.
When to use SAGEConv
- Large graphs (100K+ nodes). SAGEConv's sampling makes mini-batch training practical. GCNConv on a million-node graph requires the full adjacency matrix in memory. SAGEConv processes it in constant-size batches.
- New nodes at inference time. E-commerce catalogs, social networks, and content platforms add new entities constantly. SAGEConv computes embeddings for new nodes from their neighbors without retraining the model.
- Production systems with latency requirements. The fixed sampling budget makes inference time predictable. You can tune the number of sampled neighbors to trade accuracy for speed.
- Homogeneous graphs where you need more power than GCN. The separate weight matrices for self and neighbor features give SAGEConv more expressiveness while staying fast.
When not to use SAGEConv
1. When neighbor importance varies
SAGEConv (with mean aggregation) treats all sampled neighbors equally. In fraud detection, a transaction to a flagged merchant carries more signal than a routine purchase. For tasks where neighbor importance varies, GATConv or TransformerConv learn attention weights per edge.
2. Heterogeneous graphs
SAGEConv applies the same weight matrices to all node types and edge types. Enterprise relational databases have customers, orders, products, and merchants as different node types. Use RGCNConv, HGTConv, or HeteroConv for multi-type graphs.
3. Small graphs where full-batch training is feasible
If your graph fits in GPU memory (under ~50K nodes), the sampling introduces variance without a speed benefit. GCNConv or GATConv with full-batch training will give more stable gradients.
Comparison to alternatives
| Layer | Scalability | Inductive | Attention |
|---|---|---|---|
| GCNConv | Full-batch typical | Yes (parametric) | No |
| SAGEConv | Mini-batch via sampling | Yes | No |
| GATConv | Full-batch (can combine with sampling) | Yes | Yes |
| ClusterGCNConv | Mini-batch via clustering | Requires re-partitioning | No |
How KumoRFM builds on this
KumoRFM inherits SAGEConv's core design principles: sampling for scalability, inductive learning for new entities, and mini-batch training for bounded memory. But it extends them in three ways:
- Learned attention replaces mean aggregation, so the model automatically weights important neighbors higher
- Type-aware transformations handle heterogeneous relational databases with multiple node and edge types
- Temporal encodings capture when events happened, not just the graph structure, which is critical for time-sensitive predictions like churn and fraud
The result: you write one line of PQL and get a production-grade graph model that combines SAGEConv's scalability with transformer-level expressiveness, no PyG code required.