Label Propagation: Semi-Supervised Learning on Graphs | Kumo.ai

Label propagation is a semi-supervised learning algorithm that classifies unlabeled nodes by spreading known labels through graph edges. If 5% of nodes are labeled and 95% are not, label propagation diffuses those labels through the graph structure. A node surrounded by “fraud” labels becomes likely to be fraud. A node surrounded by “legitimate” labels becomes likely to be legitimate. No neural network, no training loop, no node features needed.

This is the simplest and oldest graph-based semi-supervised learning method. Despite its simplicity, it remains a surprisingly strong baseline. Any GNN that cannot beat label propagation is not learning useful patterns from node features.

The algorithm

label_propagation.py

import torch
from torch_geometric.nn import LabelPropagation

# 2708 nodes, 7 classes, only 140 labeled (5.2%)
# Cora citation network

model = LabelPropagation(num_layers=50, alpha=0.9)

# y: [num_nodes] ground truth labels (-1 for unlabeled)
# train_mask: [num_nodes] bool, True for labeled nodes

# One call - no training loop, no gradients, no optimizer
out = model(y, edge_index, mask=train_mask)
# out: [num_nodes, num_classes] class probabilities

pred = out.argmax(dim=-1)
# pred[i] = predicted class for node i

# On Cora: ~70% accuracy with just 140 labels

Label propagation in PyG. One function call, no training. The algorithm iteratively spreads labels through edges.

How diffusion works

Initialize: labeled nodes get one-hot vectors [1,0,0,...] for their class. Unlabeled nodes get uniform [1/C, 1/C, ...] or zeros.
Propagate: each node averages its neighbors' label distributions.
Anchor: labeled nodes are reset to their known labels (weight alpha controls how much to trust propagated vs original labels).
Repeat: iterate until convergence. The result gives class probabilities for every node.

Enterprise example: fraud investigation triage

A bank has 10 million accounts. Investigators have confirmed 1,000 as fraudulent and 5,000 as legitimate. The remaining 9,994,000 are unlabeled. The question: which accounts should investigators look at next?

Label propagation on the transaction graph:

1,000 fraud labels spread through the transaction network
Accounts with many connections to known fraudsters get high fraud probability
Accounts with connections only to known legitimate accounts get low fraud probability
Accounts in mixed neighborhoods get intermediate probabilities

The output is a risk ranking of all 10 million accounts. Investigators start with the highest-probability unlabeled accounts. This runs in minutes, requires no model training, and uses no features beyond the transaction graph and confirmed labels.

Combining with GNNs

Label propagation and GNNs are complementary:

LP as features: run label propagation, add its class probabilities as extra node features for the GNN. This gives the GNN access to structural label information.
LP for pseudo-labels: use high-confidence LP predictions as additional training labels for the GNN. This increases the effective labeled set.
LP as baseline: if GNN accuracy is below LP accuracy, the GNN is not learning from features. Debug the model.

Key Takeaways

1Label propagation spreads known labels through graph edges to classify unlabeled nodes. No neural network, no training, no features needed. Just graph structure and a few labels.
2Iterative algorithm: average neighbors, anchor known labels, repeat. Converges to class probabilities for all nodes. PyG's LabelPropagation handles this in one call.
3Surprisingly competitive baseline. On Cora with 5% labels, LP achieves ~70% accuracy. Any GNN that does not beat this is not learning useful feature patterns.
4Requires homophily (connected nodes share labels). Works well for fraud networks, citation graphs, and customer segments. Fails on heterophilic graphs.
5Best used in combination with GNNs: LP pseudo-labels as extra features or training signal. The hybrid often outperforms either alone, especially with very few labeled nodes.

Frequently asked questions

What is label propagation?

Label propagation is a semi-supervised learning algorithm that spreads known labels to unlabeled nodes through graph edges. If node A is labeled 'fraud' and is connected to unlabeled node B, B gets some probability of being fraud. After multiple iterations, label information diffuses through the graph, classifying all nodes based on their proximity to labeled ones.

How does label propagation differ from GNNs?

Label propagation is a non-parametric algorithm: no neural network, no training, no gradients. It simply diffuses labels through edges. GNNs learn parametric transformations of node features. Label propagation uses only graph structure and known labels. GNNs also use node features. In practice, label propagation is a strong baseline that GNNs must beat.

When is label propagation enough?

Label propagation works well when graph structure strongly correlates with labels (homophily: similar nodes are connected) and you have limited labeled data. On citation networks with 20 labels per class, label propagation achieves ~70% accuracy vs GCN at ~81%. If you have good node features, GNNs are better. If you have only graph structure, label propagation is surprisingly competitive.

What is homophily and why does it matter for label propagation?

Homophily means connected nodes tend to have the same label: fraudsters connect to fraudsters, researchers in the same field cite each other. Label propagation assumes homophily because it spreads labels through edges. On heterophilic graphs (connected nodes have different labels), label propagation performs poorly.

Can I combine label propagation with GNNs?

Yes. Two approaches: (1) Use label propagation as a preprocessing step to generate pseudo-labels, then train a GNN on both real and pseudo-labels. (2) Use label propagation outputs as additional node features for the GNN. Both can improve GNN performance, especially with very few labeled nodes.

Label Propagation: Semi-Supervised Learning by Spreading Labels