Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Dataset5 min read

Karate Club: 34 Nodes That Launched a Thousand Papers

The Zachary Karate Club is the most famous social network in graph theory. With just 34 members and 156 friendships, it demonstrates community detection, graph neural networks, and node classification in a dataset small enough to draw on a napkin.

PyTorch Geometric

TL;DR

  • 1Karate Club has 34 nodes (club members), 156 edges (friendships), 34-dimensional identity features, and 4 community labels.
  • 2It is the 'Hello World' of graph ML: small enough to visualize completely, debug easily, and understand intuitively.
  • 3Historically, it demonstrated that social network structure predicts group formation. Zachary correctly predicted the club's split from friendship patterns alone.
  • 4Not for benchmarking. Use only for tutorials, visualization, and code debugging. Graduate to Cora for any real evaluation.

34

Nodes

156

Edges

34

Features

4

Classes

What Karate Club contains

The Zachary Karate Club is a real social network observed by Wayne Zachary at a university in 1977. The 34 nodes represent club members. The 156 edges represent friendships observed outside of club activities. During Zachary's observation, a dispute between the club instructor (node 0) and the administrator (node 33) caused the club to split into two factions. The PyG version uses 4 community labels from spectral clustering for the node classification task.

Node features are 34-dimensional identity vectors (one-hot encoding of node ID). This means the model cannot rely on meaningful features and must learn entirely from graph structure -- making it a pure test of structural learning, albeit on a trivially small graph.

Why Karate Club matters

Karate Club matters for three reasons. First, it demonstrated in 1977 that social network structure alone can predict group behavior. Zachary used maximum flow/minimum cut to predict which faction each member would join, achieving near-perfect accuracy. This was one of the earliest proofs that network topology encodes social dynamics.

Second, it became the universal illustration for graph algorithms. Every community detection paper, graph clustering tutorial, and GNN introduction uses Karate Club as its visual example. At 34 nodes, you can draw the entire graph, label every node, and trace message passing by hand.

Third, it is the fastest possible sanity check. Loading Karate Club and running one epoch of GCN takes milliseconds. If your code crashes on Karate Club, the bug is in your code, not your data or hardware.

Loading Karate Club in PyG

load_karate.py
from torch_geometric.datasets import KarateClub

dataset = KarateClub()
data = dataset[0]

print(f"Nodes: {data.num_nodes}")        # 34
print(f"Edges: {data.num_edges}")        # 156
print(f"Features: {data.num_features}")  # 34
print(f"Classes: {data.y.max().item() + 1}")  # 4

The simplest PyG dataset to load. No download needed -- the data is hardcoded.

Common tasks and visualization

Node classification (predict community membership) and community detection (discover the 2 or 4 groups). The real value is visualization: plotting node embeddings after GNN training shows clear cluster separation, demonstrating that GNNs learn meaningful representations from graph structure. Many PyG tutorials start with this exact exercise.

Example: organizational network analysis

Karate Club's scenario -- a group splitting along social lines -- maps directly to organizational dynamics. Companies experience team splits, department reorganizations, and cultural fractures that follow communication network patterns. Network analysis on internal communication graphs (email, Slack, meeting co-attendance) can identify emerging factions before they become visible to management. The principles Zachary demonstrated in 1977 are applied in modern organizational analytics.

Historical benchmark results

Karate Club is too small for rigorous benchmarking, but these results from community detection literature are commonly cited.

MethodNMIYearPaper
Min-cut (Zachary)~1.01977Zachary
Spectral clustering~1.02007von Luxburg
Louvain~0.692008Blondel et al.
DeepWalk~1.02014Perozzi et al.
GCN (2-layer)~1.02016Kipf & Welling

NMI (Normalized Mutual Information) measures clustering quality. Most methods achieve near-perfect results because the 2-community split is well-separated. The 4-community version is harder.

Original Paper

An Information Flow Model for Conflict and Fission in Small Groups

Wayne W. Zachary (1977). Journal of Anthropological Research, 33(4), 452-473

Original data source

The Karate Club network is available from many sources. The most common machine-readable version is bundled in NetworkX as networkx.karate_club_graph(). The original paper is available from JSTOR.

cite_karate.bib
@article{zachary1977information,
  title={An Information Flow Model for Conflict and Fission in Small Groups},
  author={Zachary, Wayne W},
  journal={Journal of Anthropological Research},
  volume={33},
  number={4},
  pages={452--473},
  year={1977},
  publisher={University of New Mexico}
}

BibTeX citation for the Zachary Karate Club dataset.

Which dataset should I use?

Karate Club vs Cora: Karate Club (34 nodes) is for tutorials and visualization only. Cora (2,708 nodes) is the minimum for reproducible benchmarking. Graduate to Cora immediately once your code works.

Karate Club vs CLUSTER: If you want to benchmark community detection, use CLUSTER (12K synthetic graphs with 6 communities each). Karate Club provides one datapoint; CLUSTER provides 12,000.

Karate Club vs Reddit: Reddit (232K nodes) is a production-scale social graph. Karate Club teaches the concept; Reddit tests the scalability.

From tutorial to production

Karate Club is purely pedagogical. Production social network analysis operates on graphs with millions to billions of nodes, uses rich behavioral features (not identity vectors), and handles temporal dynamics (relationships form and dissolve over time). The gap from 34 nodes to production is enormous, but the conceptual foundation is the same: network structure predicts behavior.

Frequently asked questions

What is the Karate Club dataset?

The Zachary Karate Club is a social network of 34 members of a university karate club observed by Wayne Zachary in 1977. Nodes are members, edges (156) represent friendships. After a dispute, the club split into 2 groups (later labeled as 4 communities). It is the most cited social network in graph theory.

How do I load Karate Club in PyTorch Geometric?

Use `from torch_geometric.datasets import KarateClub; dataset = KarateClub()`. The dataset contains a single 34-node graph with node features, edges, and community labels.

Why is such a tiny dataset important?

Karate Club is important historically (it demonstrated community detection algorithms) and pedagogically (it is small enough to visualize every node and edge). It is the 'Hello World' of graph ML: if you cannot run your code on 34 nodes, something is fundamentally wrong.

Can I use Karate Club for serious benchmarking?

No. With 34 nodes, there is not enough data for meaningful train/test evaluation. Use Karate Club for visualization, debugging, and tutorials only. For benchmarking, use Cora (2.7K nodes) at minimum.

What are the 4 classes in Karate Club?

The 4 classes represent community assignments from spectral clustering. The original Zachary study identified 2 factions (the instructor's group and the administrator's group). The 4-class version provides a finer-grained community structure for node classification experiments.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.