34
Nodes
156
Edges
34
Features
4
Classes
What Karate Club contains
The Zachary Karate Club is a real social network observed by Wayne Zachary at a university in 1977. The 34 nodes represent club members. The 156 edges represent friendships observed outside of club activities. During Zachary's observation, a dispute between the club instructor (node 0) and the administrator (node 33) caused the club to split into two factions. The PyG version uses 4 community labels from spectral clustering for the node classification task.
Node features are 34-dimensional identity vectors (one-hot encoding of node ID). This means the model cannot rely on meaningful features and must learn entirely from graph structure -- making it a pure test of structural learning, albeit on a trivially small graph.
Why Karate Club matters
Karate Club matters for three reasons. First, it demonstrated in 1977 that social network structure alone can predict group behavior. Zachary used maximum flow/minimum cut to predict which faction each member would join, achieving near-perfect accuracy. This was one of the earliest proofs that network topology encodes social dynamics.
Second, it became the universal illustration for graph algorithms. Every community detection paper, graph clustering tutorial, and GNN introduction uses Karate Club as its visual example. At 34 nodes, you can draw the entire graph, label every node, and trace message passing by hand.
Third, it is the fastest possible sanity check. Loading Karate Club and running one epoch of GCN takes milliseconds. If your code crashes on Karate Club, the bug is in your code, not your data or hardware.
Loading Karate Club in PyG
from torch_geometric.datasets import KarateClub
dataset = KarateClub()
data = dataset[0]
print(f"Nodes: {data.num_nodes}") # 34
print(f"Edges: {data.num_edges}") # 156
print(f"Features: {data.num_features}") # 34
print(f"Classes: {data.y.max().item() + 1}") # 4The simplest PyG dataset to load. No download needed -- the data is hardcoded.
Common tasks and visualization
Node classification (predict community membership) and community detection (discover the 2 or 4 groups). The real value is visualization: plotting node embeddings after GNN training shows clear cluster separation, demonstrating that GNNs learn meaningful representations from graph structure. Many PyG tutorials start with this exact exercise.
Example: organizational network analysis
Karate Club's scenario -- a group splitting along social lines -- maps directly to organizational dynamics. Companies experience team splits, department reorganizations, and cultural fractures that follow communication network patterns. Network analysis on internal communication graphs (email, Slack, meeting co-attendance) can identify emerging factions before they become visible to management. The principles Zachary demonstrated in 1977 are applied in modern organizational analytics.
Historical benchmark results
Karate Club is too small for rigorous benchmarking, but these results from community detection literature are commonly cited.
| Method | NMI | Year | Paper |
|---|---|---|---|
| Min-cut (Zachary) | ~1.0 | 1977 | Zachary |
| Spectral clustering | ~1.0 | 2007 | von Luxburg |
| Louvain | ~0.69 | 2008 | Blondel et al. |
| DeepWalk | ~1.0 | 2014 | Perozzi et al. |
| GCN (2-layer) | ~1.0 | 2016 | Kipf & Welling |
NMI (Normalized Mutual Information) measures clustering quality. Most methods achieve near-perfect results because the 2-community split is well-separated. The 4-community version is harder.
Original Paper
An Information Flow Model for Conflict and Fission in Small Groups
Wayne W. Zachary (1977). Journal of Anthropological Research, 33(4), 452-473
Original data source
The Karate Club network is available from many sources. The most common machine-readable version is bundled in NetworkX as networkx.karate_club_graph(). The original paper is available from JSTOR.
@article{zachary1977information,
title={An Information Flow Model for Conflict and Fission in Small Groups},
author={Zachary, Wayne W},
journal={Journal of Anthropological Research},
volume={33},
number={4},
pages={452--473},
year={1977},
publisher={University of New Mexico}
}BibTeX citation for the Zachary Karate Club dataset.
Which dataset should I use?
Karate Club vs Cora: Karate Club (34 nodes) is for tutorials and visualization only. Cora (2,708 nodes) is the minimum for reproducible benchmarking. Graduate to Cora immediately once your code works.
Karate Club vs CLUSTER: If you want to benchmark community detection, use CLUSTER (12K synthetic graphs with 6 communities each). Karate Club provides one datapoint; CLUSTER provides 12,000.
Karate Club vs Reddit: Reddit (232K nodes) is a production-scale social graph. Karate Club teaches the concept; Reddit tests the scalability.
From tutorial to production
Karate Club is purely pedagogical. Production social network analysis operates on graphs with millions to billions of nodes, uses rich behavioral features (not identity vectors), and handles temporal dynamics (relationships form and dissolve over time). The gap from 34 nodes to production is enormous, but the conceptual foundation is the same: network structure predicts behavior.