Graph-based recommendation systems outperform collaborative filtering by learning from the full network of user-item interactions, not just a sparse matrix. When you represent users, items, and their relationships as a graph, GNNs can propagate information across multi-hop paths: a user connects to purchased items, those items connect to their categories and other buyers, those buyers connect to their other purchases. This path structure is the recommendation signal.
The limitation of matrix factorization
Collaborative filtering represents interactions as a user-item matrix and decomposes it into low-rank factors. User u's preference for item i is the dot product of their latent vectors. This works well when the matrix is dense enough, but suffers from three structural limitations:
- Sparsity: Most users interact with less than 0.1% of items. The matrix is 99.9% empty.
- No side information: Item categories, brands, user demographics, and social connections must be engineered as additional features rather than being part of the model structure.
- No path reasoning: The model cannot learn that “users who buy X also browse Y, and Y is made by the same brand as Z” because it only sees direct user-item interactions.
The graph formulation
A recommendation graph is bipartite at its core: user nodes connect to item nodes through interaction edges (purchased, viewed, rated). Adding metadata makes it heterogeneous:
- User nodes: features include demographics, account age, activity level
- Item nodes: features include price, description embeddings, popularity
- Category/Brand nodes: item metadata becomes first-class entities
- Social edges: user-follows-user captures social influence
- Interaction edges: purchase, view, rating, add-to-cart with timestamps
LightGCN: the baseline that beat everything
LightGCN (He et al., 2020) showed that for recommendation, simpler is better. It removes feature transformations and nonlinear activations from GCN, keeping only the core operation: neighborhood aggregation.
import torch
from torch_geometric.nn import LGConv
class LightGCN(torch.nn.Module):
def __init__(self, num_users, num_items, embedding_dim, num_layers):
super().__init__()
self.num_users = num_users
self.embedding = torch.nn.Embedding(
num_users + num_items, embedding_dim
)
self.convs = torch.nn.ModuleList(
[LGConv() for _ in range(num_layers)]
)
def forward(self, edge_index):
x = self.embedding.weight
xs = [x]
for conv in self.convs:
x = conv(x, edge_index)
xs.append(x)
# Average across all layers (not just final)
return torch.stack(xs, dim=0).mean(dim=0)LightGCN averages embeddings across all layers. Layer 0 is the raw embedding, layer 1 includes 1-hop neighbors, layer 2 includes 2-hop. The average captures multi-scale signals.
Why layer averaging works
Each layer captures a different scale of the graph. Layer 0 is the node's own embedding. Layer 1 incorporates direct neighbors (items a user interacted with). Layer 2 incorporates the neighbors of those items (other users who also interacted with them, plus their other items). Averaging across layers combines all these scales into a single representation.
Multi-hop paths are the recommendation signal
Consider recommending a product to user Alice. In the graph:
- 1-hop: Alice's directly purchased items (already known)
- 2-hop: Other users who bought the same items as Alice (similar users)
- 3-hop: Items those similar users bought that Alice has not seen yet (recommendations)
With 2-3 layers of message passing, Alice's embedding already encodes this 3-hop neighborhood. Scoring Alice against a candidate item is a single dot product. The entire multi-hop reasoning is baked into the embedding through graph structure.
Scaling to production: PinSage
Pinterest's PinSage (Ying et al., 2018) proved that graph recommendations work at internet scale: 3 billion nodes, 18 billion edges. The key innovations:
- Random-walk neighbor sampling: instead of using all neighbors, sample a fixed set based on random walk importance. This bounds computation per node.
- Mini-batch training: train on subgraphs, not the full graph. PyG's
NeighborLoaderimplements this directly. - Curriculum training: start with easy negatives (random items), progress to hard negatives (popular items in the same category).
Pinterest reported a 40% improvement in user engagement after deploying PinSage, demonstrating that graph structure provides signal that no amount of feature engineering on flat data can replicate.
Cold-start advantage
Matrix factorization cannot score new users or items with no interactions because they have no learned latent vector. Graph models solve this through connectivity:
- A new user's demographic features connect them to similar user nodes
- A new item's category and brand connect it to existing items in the graph
- Message passing propagates information from these connections to generate embeddings
This is the cold-start problem solved structurally rather than through heuristics.