What is the Karate Club dataset?

The Zachary Karate Club is a social network of 34 members of a university karate club observed by Wayne Zachary in 1977. Nodes are members, edges (156) represent friendships. After a dispute, the club split into 2 groups (later labeled as 4 communities). It is the most cited social network in graph theory.

How do I load Karate Club in PyTorch Geometric?

Use `from torch_geometric.datasets import KarateClub; dataset = KarateClub()`. The dataset contains a single 34-node graph with node features, edges, and community labels.

Why is such a tiny dataset important?

Karate Club is important historically (it demonstrated community detection algorithms) and pedagogically (it is small enough to visualize every node and edge). It is the 'Hello World' of graph ML: if you cannot run your code on 34 nodes, something is fundamentally wrong.

Can I use Karate Club for serious benchmarking?

No. With 34 nodes, there is not enough data for meaningful train/test evaluation. Use Karate Club for visualization, debugging, and tutorials only. For benchmarking, use Cora (2.7K nodes) at minimum.

What are the 4 classes in Karate Club?

The 4 classes represent community assignments from spectral clustering. The original Zachary study identified 2 factions (the instructor's group and the administrator's group). The 4-class version provides a finer-grained community structure for node classification experiments.

Karate Club Dataset: The Famous 34-Node Social Network | PyG Guide

Nodes

156

Edges

Features

Classes

What Karate Club contains

The Zachary Karate Club is a real social network observed by Wayne Zachary at a university in 1977. The 34 nodes represent club members. The 156 edges represent friendships observed outside of club activities. During Zachary's observation, a dispute between the club instructor (node 0) and the administrator (node 33) caused the club to split into two factions. The PyG version uses 4 community labels from spectral clustering for the node classification task.

Node features are 34-dimensional identity vectors (one-hot encoding of node ID). This means the model cannot rely on meaningful features and must learn entirely from graph structure -- making it a pure test of structural learning, albeit on a trivially small graph.

Why Karate Club matters

Karate Club matters for three reasons. First, it demonstrated in 1977 that social network structure alone can predict group behavior. Zachary used maximum flow/minimum cut to predict which faction each member would join, achieving near-perfect accuracy. This was one of the earliest proofs that network topology encodes social dynamics.

Second, it became the universal illustration for graph algorithms. Every community detection paper, graph clustering tutorial, and GNN introduction uses Karate Club as its visual example. At 34 nodes, you can draw the entire graph, label every node, and trace message passing by hand.

Third, it is the fastest possible sanity check. Loading Karate Club and running one epoch of GCN takes milliseconds. If your code crashes on Karate Club, the bug is in your code, not your data or hardware.

Loading Karate Club in PyG

load_karate.py

from torch_geometric.datasets import KarateClub

dataset = KarateClub()
data = dataset[0]

print(f"Nodes: {data.num_nodes}")        # 34
print(f"Edges: {data.num_edges}")        # 156
print(f"Features: {data.num_features}")  # 34
print(f"Classes: {data.y.max().item() + 1}")  # 4

The simplest PyG dataset to load. No download needed -- the data is hardcoded.

Common tasks and visualization

Node classification (predict community membership) and community detection (discover the 2 or 4 groups). The real value is visualization: plotting node embeddings after GNN training shows clear cluster separation, demonstrating that GNNs learn meaningful representations from graph structure. Many PyG tutorials start with this exact exercise.

Example: organizational network analysis

Karate Club's scenario -- a group splitting along social lines -- maps directly to organizational dynamics. Companies experience team splits, department reorganizations, and cultural fractures that follow communication network patterns. Network analysis on internal communication graphs (email, Slack, meeting co-attendance) can identify emerging factions before they become visible to management. The principles Zachary demonstrated in 1977 are applied in modern organizational analytics.

Historical benchmark results

Karate Club is too small for rigorous benchmarking, but these results from community detection literature are commonly cited.

Method	NMI	Year	Paper
Min-cut (Zachary)	~1.0	1977	Zachary
Spectral clustering	~1.0	2007	von Luxburg
Louvain	~0.69	2008	Blondel et al.
DeepWalk	~1.0	2014	Perozzi et al.
GCN (2-layer)	~1.0	2016	Kipf & Welling

NMI (Normalized Mutual Information) measures clustering quality. Most methods achieve near-perfect results because the 2-community split is well-separated. The 4-community version is harder.

Original Paper

An Information Flow Model for Conflict and Fission in Small Groups

Wayne W. Zachary (1977). Journal of Anthropological Research, 33(4), 452-473

Original data source

The Karate Club network is available from many sources. The most common machine-readable version is bundled in NetworkX as networkx.karate_club_graph(). The original paper is available from JSTOR.

cite_karate.bib

@article{zachary1977information,
  title={An Information Flow Model for Conflict and Fission in Small Groups},
  author={Zachary, Wayne W},
  journal={Journal of Anthropological Research},
  volume={33},
  number={4},
  pages={452--473},
  year={1977},
  publisher={University of New Mexico}
}

BibTeX citation for the Zachary Karate Club dataset.

Which dataset should I use?

Karate Club vs Cora: Karate Club (34 nodes) is for tutorials and visualization only. Cora (2,708 nodes) is the minimum for reproducible benchmarking. Graduate to Cora immediately once your code works.

Karate Club vs CLUSTER: If you want to benchmark community detection, use CLUSTER (12K synthetic graphs with 6 communities each). Karate Club provides one datapoint; CLUSTER provides 12,000.

Karate Club vs Reddit: Reddit (232K nodes) is a production-scale social graph. Karate Club teaches the concept; Reddit tests the scalability.

From tutorial to production

Karate Club is purely pedagogical. Production social network analysis operates on graphs with millions to billions of nodes, uses rich behavioral features (not identity vectors), and handles temporal dynamics (relationships form and dissolve over time). The gap from 34 nodes to production is enormous, but the conceptual foundation is the same: network structure predicts behavior.

Karate Club: 34 Nodes That Launched a Thousand Papers