Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Dataset8 min read

Reddit: The Dataset That Made GNNs Scale

Reddit is a social network of 232,965 posts connected by 114 million co-comment edges. It is where GNN scalability research began: too large for full-batch GCN, perfectly sized to test whether your sampling strategy, mini-batching, and distributed training actually work.

PyTorch Geometric

TL;DR

  • 1Reddit has 232,965 post nodes, 114,615,892 edges, 602-dimensional features, and 41 subreddit classes. It is the standard large-scale GNN benchmark.
  • 2This is where full-batch GCN breaks down. The graph requires mini-batching or neighbor sampling (GraphSAGE, ClusterGCN) to fit in GPU memory.
  • 3GraphSAGE achieves ~95.4% accuracy using inductive neighbor sampling. The high accuracy reflects Reddit's clean community structure, but the engineering challenge is the real test.
  • 4Reddit represents the scale inflection point: methods that work here can be adapted for production graphs with millions to billions of nodes.
  • 5KumoRFM handles graphs orders of magnitude larger than Reddit. Its distributed graph transformer processes billion-node enterprise graphs without manual sampling or batching strategies.

232,965

Nodes

114.6M

Edges

602

Features

41

Classes

What Reddit contains

Reddit is a social graph built from one month of Reddit posts (September 2014). Each of the 232,965 nodes represents a post. An edge connects two posts if the same user commented on both, creating 114,615,892 co-comment links. Node features are 602-dimensional vectors derived from post content (GloVe word embeddings of the post title and comments). The task is to predict which of 41 subreddit communities a post belongs to.

The graph is massive compared to Planetoid benchmarks: 86x more nodes than Cora and over 10,000x more edges. A single full-batch forward pass through a GCN on Reddit requires storing and multiplying matrices that exceed typical GPU memory. This is the defining characteristic that makes Reddit important.

Why Reddit matters

Reddit is the dataset that forced the GNN community to solve scalability. When Hamilton, Ying, and Leskovec introduced GraphSAGE in 2017, they used Reddit as the primary benchmark specifically because existing GNN methods could not handle it efficiently. The result was the neighbor sampling paradigm: instead of aggregating all neighbors, sample a fixed number per hop. This single innovation made GNNs practical for large graphs and remains the foundation of production GNN systems.

Beyond scalability, Reddit demonstrates GNNs on social data. The co-comment structure encodes community membership: posts in the same subreddit are commented on by the same users. GNNs learn this community structure implicitly through neighborhood aggregation, achieving 95%+ accuracy.

Loading Reddit in PyG

load_reddit.py
from torch_geometric.datasets import Reddit
from torch_geometric.loader import NeighborLoader

dataset = Reddit(root='/tmp/Reddit')
data = dataset[0]

print(f"Nodes: {data.num_nodes}")   # 232965
print(f"Edges: {data.num_edges}")   # 114615892

# Use NeighborLoader for mini-batch training
train_loader = NeighborLoader(
    data, num_neighbors=[25, 10],
    batch_size=1024, input_nodes=data.train_mask,
)
# Each batch samples 25 1-hop and 10 2-hop neighbors per node

NeighborLoader samples subgraphs for mini-batch training. Essential for fitting Reddit in GPU memory.

Original Paper

Inductive Representation Learning on Large Graphs

William L. Hamilton, Rex Ying, Jure Leskovec (2017). NeurIPS 2017

Read paper →

Benchmark comparison (inductive split)

MethodF1-microYearPaper
GCN (full-batch)93.4%2017Kipf & Welling
GraphSAGE95.4%2017Hamilton et al.
GAT~94.5%2018Velickovic et al.
ClusterGCN96.6%2019Chiang et al.
GraphSAINT97.0%2020Zeng et al.

Which large-scale social dataset should I use?

Flickr (89K nodes, 7 classes) is medium-scale with noisy edges -- use it to test robustness to imperfect graph structure. Reddit (232K nodes, 41 classes) is the standard scalability benchmark with clean community structure -- use it to prove your sampling and mini-batching work. Yelp (716K nodes, 100 multi-labels) is larger and tests multi-label classification -- use it for more complex output tasks. Reddit is the most widely cited of the three and the best starting point for scalability research.

Common tasks and benchmarks

The standard task is inductive node classification. The dataset has a time-based split: the first 20 days for training, the next for validation, and the final days for testing. This inductive setup means the model must generalize to unseen posts, not just classify posts it saw during training.

Benchmarks: GraphSAGE ~95.4%, GCN (full-batch) ~93.4%, GAT ~94.5%, ClusterGCN ~96.6%, GraphSAINT ~97.0%. The accuracy differences are small, but the computational costs vary by 10x. Training time and memory usage are the real metrics that differentiate methods on Reddit.

Data source

The Reddit dataset was introduced in the GraphSAGE paper and can be downloaded from Stanford SNAP's GraphSAGE page. PyG downloads the processed version automatically.

BibTeX citation

reddit.bib
@inproceedings{hamilton2017inductive,
  title={Inductive Representation Learning on Large Graphs},
  author={Hamilton, William L. and Ying, Rex and Leskovec, Jure},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2017}
}

Cite Hamilton et al. for both the dataset and the GraphSAGE method.

Example: community detection at scale

Social platforms need to classify content into communities, detect emerging topics, and identify cross-community influence. Reddit's structure maps directly to these tasks. A company monitoring brand mentions across social media needs to understand which communities are discussing their product and how sentiment spreads between communities. GNNs trained on co-comment graphs capture exactly this inter-community signal.

From benchmark to production

Reddit's 232K nodes are large by benchmark standards but small by production standards. Facebook has billions of users, Twitter has hundreds of millions of daily active conversations, and enterprise customer graphs routinely exceed 100 million nodes. The sampling strategies proven on Reddit (GraphSAGE, ClusterGCN) are necessary but not sufficient for these scales. Production systems require distributed training, graph partitioning, and efficient serving infrastructure.

Frequently asked questions

What is the Reddit dataset?

Reddit is a social network graph where 232,965 nodes represent posts from September 2014. Edges (114M) connect posts that were commented on by the same user. Each post has a 602-dimensional feature vector (word embeddings). The task is to classify posts into 41 subreddit communities.

How do I load the Reddit dataset in PyTorch Geometric?

Use `from torch_geometric.datasets import Reddit; dataset = Reddit(root='/tmp/Reddit')`. The download is ~1.3GB. A GPU with at least 16GB memory is recommended for full-batch training. For memory-constrained setups, use neighbor sampling with NeighborLoader.

Why is Reddit important for GNN research?

Reddit is the standard benchmark for testing GNN scalability. At 232K nodes and 114M edges, it is the first dataset where most GNN implementations require mini-batching or sampling to fit in memory. It was the primary benchmark in the original GraphSAGE paper.

What accuracy should I expect on Reddit?

GraphSAGE achieves ~95.4% accuracy. GCN with full-batch training reaches ~93.4%. GAT achieves ~94.5%. The high baseline across methods reflects Reddit's relatively clean community structure, but the computational cost varies dramatically between approaches.

Can I train on Reddit without a GPU?

Full-batch training on Reddit requires significant memory and is impractical on CPU. However, with PyG's NeighborLoader and mini-batch training, you can train on Reddit using a modest GPU (8GB+) by sampling neighborhoods rather than loading the full graph.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.