232,965
Nodes
114.6M
Edges
602
Features
41
Classes
What Reddit contains
Reddit is a social graph built from one month of Reddit posts (September 2014). Each of the 232,965 nodes represents a post. An edge connects two posts if the same user commented on both, creating 114,615,892 co-comment links. Node features are 602-dimensional vectors derived from post content (GloVe word embeddings of the post title and comments). The task is to predict which of 41 subreddit communities a post belongs to.
The graph is massive compared to Planetoid benchmarks: 86x more nodes than Cora and over 10,000x more edges. A single full-batch forward pass through a GCN on Reddit requires storing and multiplying matrices that exceed typical GPU memory. This is the defining characteristic that makes Reddit important.
Why Reddit matters
Reddit is the dataset that forced the GNN community to solve scalability. When Hamilton, Ying, and Leskovec introduced GraphSAGE in 2017, they used Reddit as the primary benchmark specifically because existing GNN methods could not handle it efficiently. The result was the neighbor sampling paradigm: instead of aggregating all neighbors, sample a fixed number per hop. This single innovation made GNNs practical for large graphs and remains the foundation of production GNN systems.
Beyond scalability, Reddit demonstrates GNNs on social data. The co-comment structure encodes community membership: posts in the same subreddit are commented on by the same users. GNNs learn this community structure implicitly through neighborhood aggregation, achieving 95%+ accuracy.
Loading Reddit in PyG
from torch_geometric.datasets import Reddit
from torch_geometric.loader import NeighborLoader
dataset = Reddit(root='/tmp/Reddit')
data = dataset[0]
print(f"Nodes: {data.num_nodes}") # 232965
print(f"Edges: {data.num_edges}") # 114615892
# Use NeighborLoader for mini-batch training
train_loader = NeighborLoader(
data, num_neighbors=[25, 10],
batch_size=1024, input_nodes=data.train_mask,
)
# Each batch samples 25 1-hop and 10 2-hop neighbors per nodeNeighborLoader samples subgraphs for mini-batch training. Essential for fitting Reddit in GPU memory.
Original Paper
Inductive Representation Learning on Large Graphs
William L. Hamilton, Rex Ying, Jure Leskovec (2017). NeurIPS 2017
Read paper →Benchmark comparison (inductive split)
| Method | F1-micro | Year | Paper |
|---|---|---|---|
| GCN (full-batch) | 93.4% | 2017 | Kipf & Welling |
| GraphSAGE | 95.4% | 2017 | Hamilton et al. |
| GAT | ~94.5% | 2018 | Velickovic et al. |
| ClusterGCN | 96.6% | 2019 | Chiang et al. |
| GraphSAINT | 97.0% | 2020 | Zeng et al. |
Which large-scale social dataset should I use?
Flickr (89K nodes, 7 classes) is medium-scale with noisy edges -- use it to test robustness to imperfect graph structure. Reddit (232K nodes, 41 classes) is the standard scalability benchmark with clean community structure -- use it to prove your sampling and mini-batching work. Yelp (716K nodes, 100 multi-labels) is larger and tests multi-label classification -- use it for more complex output tasks. Reddit is the most widely cited of the three and the best starting point for scalability research.
Common tasks and benchmarks
The standard task is inductive node classification. The dataset has a time-based split: the first 20 days for training, the next for validation, and the final days for testing. This inductive setup means the model must generalize to unseen posts, not just classify posts it saw during training.
Benchmarks: GraphSAGE ~95.4%, GCN (full-batch) ~93.4%, GAT ~94.5%, ClusterGCN ~96.6%, GraphSAINT ~97.0%. The accuracy differences are small, but the computational costs vary by 10x. Training time and memory usage are the real metrics that differentiate methods on Reddit.
Data source
The Reddit dataset was introduced in the GraphSAGE paper and can be downloaded from Stanford SNAP's GraphSAGE page. PyG downloads the processed version automatically.
BibTeX citation
@inproceedings{hamilton2017inductive,
title={Inductive Representation Learning on Large Graphs},
author={Hamilton, William L. and Ying, Rex and Leskovec, Jure},
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
year={2017}
}Cite Hamilton et al. for both the dataset and the GraphSAGE method.
Example: community detection at scale
Social platforms need to classify content into communities, detect emerging topics, and identify cross-community influence. Reddit's structure maps directly to these tasks. A company monitoring brand mentions across social media needs to understand which communities are discussing their product and how sentiment spreads between communities. GNNs trained on co-comment graphs capture exactly this inter-community signal.
From benchmark to production
Reddit's 232K nodes are large by benchmark standards but small by production standards. Facebook has billions of users, Twitter has hundreds of millions of daily active conversations, and enterprise customer graphs routinely exceed 100 million nodes. The sampling strategies proven on Reddit (GraphSAGE, ClusterGCN) are necessary but not sufficient for these scales. Production systems require distributed training, graph partitioning, and efficient serving infrastructure.