What is OGB-Papers100M?

OGB-Papers100M (ogbn-papers100M) is a citation graph of 111,059,956 papers with 1,615,685,872 edges. Each paper has a 128-dimensional word2vec feature vector and one of 172 subject area labels. It is the largest standard GNN benchmark and tests true production-scale graph processing.

Can I train on OGB-Papers100M with a single GPU?

Not practically. The graph requires ~400GB of memory to store. Training requires distributed multi-GPU setups with graph partitioning (DistDGL, PyG's distributed module), or aggressive subsampling with NeighborLoader at very small batch sizes. Most research papers use 4-8 GPUs minimum.

How do I load OGB-Papers100M?

Use `from ogb.nodeproppred import PygNodePropPredDataset; dataset = PygNodePropPredDataset(name='ogbn-papers100M')`. The download is ~57GB. Loading requires significant RAM (400GB+ recommended). Use memory-mapped files or distributed loading for constrained environments.

What results are expected on OGB-Papers100M?

Due to its extreme size, few methods have been fully evaluated. SIGN achieves ~65.7% accuracy. GraphSAGE variants achieve ~67%. Top methods with GIANT embeddings reach ~69.7%. The challenge is primarily engineering (fitting the graph in memory, efficient distributed training) rather than architecture.

Why does OGB-Papers100M exist?

Most GNN benchmarks are too small to expose real scalability challenges. Even OGB-Products (2.4M nodes) fits on a single GPU with sampling. Papers100M requires truly distributed infrastructure, closing the gap between benchmark scale and production graph sizes at companies like Google and Meta.

OGB-Papers100M Dataset: 111M Node Citation Graph | PyG Guide

111M

Nodes

1.6B

Edges

128

Features

172

Classes

What OGB-Papers100M contains

OGB-Papers100M (ogbn-papers100M) is a citation graph from the Microsoft Academic Graph. The 111,059,956 nodes represent academic papers spanning all scientific disciplines. The 1,615,685,872 edges represent citation links. Each paper has a 128-dimensional word2vec feature vector from its title and abstract. The 172 classes are academic subject areas (spanning CS, physics, biology, medicine, and more).

The numbers are staggering. The graph is 41,000x larger than Cora, 477x larger than Reddit, and 45x larger than OGB-Products. Just storing the edge list requires ~12GB. Storing node features adds ~54GB. The full working memory during training exceeds 400GB. This is not a dataset for laptops.

Why OGB-Papers100M matters

Most GNN papers claim scalability but test on datasets that fit on a single GPU. OGB-Products (2.4M nodes) is “large” by benchmark standards but tiny by production standards. Papers100M closes this gap. At 111M nodes and 1.6B edges, it approaches the scale of real production graphs at companies like Google, Meta, and LinkedIn.

The engineering challenges are real: How do you partition a billion- edge graph across GPUs? How do you sample neighbors without cross-machine communication bottlenecks? How do you handle checkpointing and fault recovery for multi-day training runs? These questions are invisible on smaller benchmarks but dominate at Papers100M scale.

Loading OGB-Papers100M

load_ogb_papers100m.py

from ogb.nodeproppred import PygNodePropPredDataset

# WARNING: ~57GB download, ~400GB RAM to load
dataset = PygNodePropPredDataset(name='ogbn-papers100M',
                                  root='/data/OGB')
data = dataset[0]
split_idx = dataset.get_idx_split()

print(f"Nodes: {data.num_nodes}")   # 111059956
print(f"Edges: {data.num_edges}")   # 1615685872

# Distributed training with graph partitioning required
# See PyG Distributed or DistDGL documentation

Requires significant disk space and RAM. Use memory-mapped files or distributed loading.

Original Paper

Open Graph Benchmark: Datasets for Machine Learning on Graphs

Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, Jure Leskovec (2020). NeurIPS 2020

Read paper →

Benchmark comparison (OGB leaderboard, test accuracy)

Method	Accuracy	Year	Paper
MLP (no graph)	47.24%	2020	OGB baseline
SIGN	65.68%	2020	Frasca et al.
GraphSAGE (res_incep)	67.06%	2020	OGB baseline
GAMLP	67.71%	2022	Zhang et al.
GAMLP+RLU	68.25%	2022	Zhang et al.
GIANT-XRT + GAMLP+RLU	69.67%	2022	Chien et al.

Which OGB dataset should I use?

OGB-Products (2.4M nodes, 61.9M edges) is manageable on a single GPU with sampling -- use it to prove your method scales beyond Reddit. OGB-Papers100M (111M nodes, 1.6B edges) requires distributed multi-GPU infrastructure -- use it only if you are building or validating truly distributed GNN systems. Most researchers should benchmark on OGB-Products. OGB-Papers100M is for infrastructure teams and systems researchers.

Common tasks and benchmarks

Node classification with OGB's time-based split. Due to the extreme scale, few methods have complete results. SIGN (Scalable Inception Graph Networks): ~65.7%. GraphSAGE variants: ~67.1%. Simple MLP (no graph): ~47.2%. The 20-point gap between MLP and graph methods confirms that citation structure is highly informative even at this scale. But the engineering cost of leveraging that structure is substantial.

Data source

OGB-Papers100M is part of the Open Graph Benchmark. Download via the ogb Python package or from the OGB website. Warning: the download is ~57GB and loading requires ~400GB RAM.

BibTeX citation

ogb_papers100m.bib

@inproceedings{hu2020open,
  title={Open Graph Benchmark: Datasets for Machine Learning on Graphs},
  author={Hu, Weihua and Fey, Matthias and Zitnik, Marinka and Dong, Yuxiao and Ren, Hongyu and Liu, Bowen and Catasta, Michele and Leskovec, Jure},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2020}
}

Cite Hu et al. for all OGB datasets. Include the specific dataset name (ogbn-papers100M) in your paper.

Example: scientific literature search

Semantic Scholar indexes 200M+ academic papers. Classifying papers by subject area, identifying citation trends, and recommending relevant papers all operate at Papers100M scale. A GNN that can process the full citation graph provides richer paper representations than text-only models, capturing how a paper fits into the broader scientific landscape through its citation relationships.

From benchmark to production

Papers100M is close to production scale for some applications but still simpler than the most demanding real-world graphs. Production social networks have billions of nodes with dynamic edges (new connections every second). Enterprise knowledge graphs combine hundreds of entity types and relationship types. And real-time serving adds latency constraints that batch training does not face.

OGB-Papers100M: 111 Million Nodes. The Ultimate GNN Scale Test.