111M
Nodes
1.6B
Edges
128
Features
172
Classes
What OGB-Papers100M contains
OGB-Papers100M (ogbn-papers100M) is a citation graph from the Microsoft Academic Graph. The 111,059,956 nodes represent academic papers spanning all scientific disciplines. The 1,615,685,872 edges represent citation links. Each paper has a 128-dimensional word2vec feature vector from its title and abstract. The 172 classes are academic subject areas (spanning CS, physics, biology, medicine, and more).
The numbers are staggering. The graph is 41,000x larger than Cora, 477x larger than Reddit, and 45x larger than OGB-Products. Just storing the edge list requires ~12GB. Storing node features adds ~54GB. The full working memory during training exceeds 400GB. This is not a dataset for laptops.
Why OGB-Papers100M matters
Most GNN papers claim scalability but test on datasets that fit on a single GPU. OGB-Products (2.4M nodes) is “large” by benchmark standards but tiny by production standards. Papers100M closes this gap. At 111M nodes and 1.6B edges, it approaches the scale of real production graphs at companies like Google, Meta, and LinkedIn.
The engineering challenges are real: How do you partition a billion- edge graph across GPUs? How do you sample neighbors without cross-machine communication bottlenecks? How do you handle checkpointing and fault recovery for multi-day training runs? These questions are invisible on smaller benchmarks but dominate at Papers100M scale.
Loading OGB-Papers100M
from ogb.nodeproppred import PygNodePropPredDataset
# WARNING: ~57GB download, ~400GB RAM to load
dataset = PygNodePropPredDataset(name='ogbn-papers100M',
root='/data/OGB')
data = dataset[0]
split_idx = dataset.get_idx_split()
print(f"Nodes: {data.num_nodes}") # 111059956
print(f"Edges: {data.num_edges}") # 1615685872
# Distributed training with graph partitioning required
# See PyG Distributed or DistDGL documentationRequires significant disk space and RAM. Use memory-mapped files or distributed loading.
Original Paper
Open Graph Benchmark: Datasets for Machine Learning on Graphs
Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, Jure Leskovec (2020). NeurIPS 2020
Read paper →Benchmark comparison (OGB leaderboard, test accuracy)
| Method | Accuracy | Year | Paper |
|---|---|---|---|
| MLP (no graph) | 47.24% | 2020 | OGB baseline |
| SIGN | 65.68% | 2020 | Frasca et al. |
| GraphSAGE (res_incep) | 67.06% | 2020 | OGB baseline |
| GAMLP | 67.71% | 2022 | Zhang et al. |
| GAMLP+RLU | 68.25% | 2022 | Zhang et al. |
| GIANT-XRT + GAMLP+RLU | 69.67% | 2022 | Chien et al. |
Which OGB dataset should I use?
OGB-Products (2.4M nodes, 61.9M edges) is manageable on a single GPU with sampling -- use it to prove your method scales beyond Reddit. OGB-Papers100M (111M nodes, 1.6B edges) requires distributed multi-GPU infrastructure -- use it only if you are building or validating truly distributed GNN systems. Most researchers should benchmark on OGB-Products. OGB-Papers100M is for infrastructure teams and systems researchers.
Common tasks and benchmarks
Node classification with OGB's time-based split. Due to the extreme scale, few methods have complete results. SIGN (Scalable Inception Graph Networks): ~65.7%. GraphSAGE variants: ~67.1%. Simple MLP (no graph): ~47.2%. The 20-point gap between MLP and graph methods confirms that citation structure is highly informative even at this scale. But the engineering cost of leveraging that structure is substantial.
Data source
OGB-Papers100M is part of the Open Graph Benchmark. Download via the ogb Python package or from the OGB website. Warning: the download is ~57GB and loading requires ~400GB RAM.
BibTeX citation
@inproceedings{hu2020open,
title={Open Graph Benchmark: Datasets for Machine Learning on Graphs},
author={Hu, Weihua and Fey, Matthias and Zitnik, Marinka and Dong, Yuxiao and Ren, Hongyu and Liu, Bowen and Catasta, Michele and Leskovec, Jure},
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
year={2020}
}Cite Hu et al. for all OGB datasets. Include the specific dataset name (ogbn-papers100M) in your paper.
Example: scientific literature search
Semantic Scholar indexes 200M+ academic papers. Classifying papers by subject area, identifying citation trends, and recommending relevant papers all operate at Papers100M scale. A GNN that can process the full citation graph provides richer paper representations than text-only models, capturing how a paper fits into the broader scientific landscape through its citation relationships.
From benchmark to production
Papers100M is close to production scale for some applications but still simpler than the most demanding real-world graphs. Production social networks have billions of nodes with dynamic edges (new connections every second). Enterprise knowledge graphs combine hundreds of entity types and relationship types. And real-time serving adds latency constraints that batch training does not face.