Label propagation is a semi-supervised learning algorithm that classifies unlabeled nodes by spreading known labels through graph edges. If 5% of nodes are labeled and 95% are not, label propagation diffuses those labels through the graph structure. A node surrounded by “fraud” labels becomes likely to be fraud. A node surrounded by “legitimate” labels becomes likely to be legitimate. No neural network, no training loop, no node features needed.
This is the simplest and oldest graph-based semi-supervised learning method. Despite its simplicity, it remains a surprisingly strong baseline. Any GNN that cannot beat label propagation is not learning useful patterns from node features.
The algorithm
import torch
from torch_geometric.nn import LabelPropagation
# 2708 nodes, 7 classes, only 140 labeled (5.2%)
# Cora citation network
model = LabelPropagation(num_layers=50, alpha=0.9)
# y: [num_nodes] ground truth labels (-1 for unlabeled)
# train_mask: [num_nodes] bool, True for labeled nodes
# One call - no training loop, no gradients, no optimizer
out = model(y, edge_index, mask=train_mask)
# out: [num_nodes, num_classes] class probabilities
pred = out.argmax(dim=-1)
# pred[i] = predicted class for node i
# On Cora: ~70% accuracy with just 140 labelsLabel propagation in PyG. One function call, no training. The algorithm iteratively spreads labels through edges.
How diffusion works
- Initialize: labeled nodes get one-hot vectors [1,0,0,...] for their class. Unlabeled nodes get uniform [1/C, 1/C, ...] or zeros.
- Propagate: each node averages its neighbors' label distributions.
- Anchor: labeled nodes are reset to their known labels (weight alpha controls how much to trust propagated vs original labels).
- Repeat: iterate until convergence. The result gives class probabilities for every node.
Enterprise example: fraud investigation triage
A bank has 10 million accounts. Investigators have confirmed 1,000 as fraudulent and 5,000 as legitimate. The remaining 9,994,000 are unlabeled. The question: which accounts should investigators look at next?
Label propagation on the transaction graph:
- 1,000 fraud labels spread through the transaction network
- Accounts with many connections to known fraudsters get high fraud probability
- Accounts with connections only to known legitimate accounts get low fraud probability
- Accounts in mixed neighborhoods get intermediate probabilities
The output is a risk ranking of all 10 million accounts. Investigators start with the highest-probability unlabeled accounts. This runs in minutes, requires no model training, and uses no features beyond the transaction graph and confirmed labels.
Combining with GNNs
Label propagation and GNNs are complementary:
- LP as features: run label propagation, add its class probabilities as extra node features for the GNN. This gives the GNN access to structural label information.
- LP for pseudo-labels: use high-confidence LP predictions as additional training labels for the GNN. This increases the effective labeled set.
- LP as baseline: if GNN accuracy is below LP accuracy, the GNN is not learning from features. Debug the model.