Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Guide7 min read

Label Propagation: Semi-Supervised Learning by Spreading Labels

Label propagation classifies unlabeled nodes by spreading known labels through graph edges. No neural network, no training, no features needed. Just graph structure and a few labeled nodes.

PyTorch Geometric

TL;DR

  • 1Label propagation spreads known labels to unlabeled nodes through graph edges. Connected nodes are likely to share labels (homophily assumption). No neural network or training required.
  • 2Algorithm: initialize labeled nodes with one-hot labels, iterate neighbor averaging with anchoring of known labels. Convergence gives class probabilities for all nodes.
  • 3Surprisingly competitive: on citation networks with 20 labels per class, label propagation achieves ~70% accuracy. It is the baseline every GNN must beat.
  • 4Works best with high homophily (connected nodes share labels). Fails on heterophilic graphs where connected nodes tend to have different labels.
  • 5Combine with GNNs: use label propagation pseudo-labels as additional features or training signal. This hybrid approach often outperforms either method alone.

Label propagation is a semi-supervised learning algorithm that classifies unlabeled nodes by spreading known labels through graph edges. If 5% of nodes are labeled and 95% are not, label propagation diffuses those labels through the graph structure. A node surrounded by “fraud” labels becomes likely to be fraud. A node surrounded by “legitimate” labels becomes likely to be legitimate. No neural network, no training loop, no node features needed.

This is the simplest and oldest graph-based semi-supervised learning method. Despite its simplicity, it remains a surprisingly strong baseline. Any GNN that cannot beat label propagation is not learning useful patterns from node features.

The algorithm

label_propagation.py
import torch
from torch_geometric.nn import LabelPropagation

# 2708 nodes, 7 classes, only 140 labeled (5.2%)
# Cora citation network

model = LabelPropagation(num_layers=50, alpha=0.9)

# y: [num_nodes] ground truth labels (-1 for unlabeled)
# train_mask: [num_nodes] bool, True for labeled nodes

# One call - no training loop, no gradients, no optimizer
out = model(y, edge_index, mask=train_mask)
# out: [num_nodes, num_classes] class probabilities

pred = out.argmax(dim=-1)
# pred[i] = predicted class for node i

# On Cora: ~70% accuracy with just 140 labels

Label propagation in PyG. One function call, no training. The algorithm iteratively spreads labels through edges.

How diffusion works

  1. Initialize: labeled nodes get one-hot vectors [1,0,0,...] for their class. Unlabeled nodes get uniform [1/C, 1/C, ...] or zeros.
  2. Propagate: each node averages its neighbors' label distributions.
  3. Anchor: labeled nodes are reset to their known labels (weight alpha controls how much to trust propagated vs original labels).
  4. Repeat: iterate until convergence. The result gives class probabilities for every node.

Enterprise example: fraud investigation triage

A bank has 10 million accounts. Investigators have confirmed 1,000 as fraudulent and 5,000 as legitimate. The remaining 9,994,000 are unlabeled. The question: which accounts should investigators look at next?

Label propagation on the transaction graph:

  • 1,000 fraud labels spread through the transaction network
  • Accounts with many connections to known fraudsters get high fraud probability
  • Accounts with connections only to known legitimate accounts get low fraud probability
  • Accounts in mixed neighborhoods get intermediate probabilities

The output is a risk ranking of all 10 million accounts. Investigators start with the highest-probability unlabeled accounts. This runs in minutes, requires no model training, and uses no features beyond the transaction graph and confirmed labels.

Combining with GNNs

Label propagation and GNNs are complementary:

  • LP as features: run label propagation, add its class probabilities as extra node features for the GNN. This gives the GNN access to structural label information.
  • LP for pseudo-labels: use high-confidence LP predictions as additional training labels for the GNN. This increases the effective labeled set.
  • LP as baseline: if GNN accuracy is below LP accuracy, the GNN is not learning from features. Debug the model.

Frequently asked questions

What is label propagation?

Label propagation is a semi-supervised learning algorithm that spreads known labels to unlabeled nodes through graph edges. If node A is labeled 'fraud' and is connected to unlabeled node B, B gets some probability of being fraud. After multiple iterations, label information diffuses through the graph, classifying all nodes based on their proximity to labeled ones.

How does label propagation differ from GNNs?

Label propagation is a non-parametric algorithm: no neural network, no training, no gradients. It simply diffuses labels through edges. GNNs learn parametric transformations of node features. Label propagation uses only graph structure and known labels. GNNs also use node features. In practice, label propagation is a strong baseline that GNNs must beat.

When is label propagation enough?

Label propagation works well when graph structure strongly correlates with labels (homophily: similar nodes are connected) and you have limited labeled data. On citation networks with 20 labels per class, label propagation achieves ~70% accuracy vs GCN at ~81%. If you have good node features, GNNs are better. If you have only graph structure, label propagation is surprisingly competitive.

What is homophily and why does it matter for label propagation?

Homophily means connected nodes tend to have the same label: fraudsters connect to fraudsters, researchers in the same field cite each other. Label propagation assumes homophily because it spreads labels through edges. On heterophilic graphs (connected nodes have different labels), label propagation performs poorly.

Can I combine label propagation with GNNs?

Yes. Two approaches: (1) Use label propagation as a preprocessing step to generate pseudo-labels, then train a GNN on both real and pseudo-labels. (2) Use label propagation outputs as additional node features for the GNN. Both can improve GNN performance, especially with very few labeled nodes.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.