Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Dataset7 min read

PubMed: The Largest Planetoid Dataset and a Bridge to Real-Scale Graphs

PubMed is a citation network of 19,717 medical papers about diabetes. It is 7x larger than Cora and introduces the first hints of scale -- large enough that training choices start to matter, small enough that you can still iterate quickly.

PyTorch Geometric

TL;DR

  • 1PubMed contains 19,717 papers (nodes) connected by 88,648 citations (edges). Features are 500-dimensional TF-IDF vectors. Papers belong to one of 3 diabetes categories.
  • 2At 7x the size of Cora, PubMed is where scalability starts to matter. Mini-batching and neighbor sampling become worthwhile optimizations.
  • 3GCN and GAT both achieve ~79% accuracy. The performance gap between methods narrows on PubMed's 3-class task, making it less useful for differentiating architectures.
  • 4PubMed bridges toy benchmarks (Cora) and production-scale graphs (OGB). Use it to validate that your training pipeline handles medium-sized graphs before scaling further.

19,717

Nodes

88,648

Edges

500

Features

3

Classes

What PubMed contains

PubMed is a citation network drawn from the PubMed biomedical literature database. The 19,717 nodes are papers about diabetes, connected by 88,648 citation edges. Each paper has a 500-dimensional TF-IDF weighted word vector as features. The classification task has 3 categories: Diabetes Mellitus Type 1, Diabetes Mellitus Type 2, and Diabetes Mellitus Experimental.

With only 3 classes and relatively dense connections (average degree ~4.5), PubMed is easier to classify than CiteSeer. Its value lies in its size: at nearly 20,000 nodes and 89,000 edges, it tests whether your code and training pipeline handle graphs that no longer fit in a single matrix multiplication without memory concerns.

Why PubMed matters

PubMed sits at an important inflection point. Cora (2,708 nodes) and CiteSeer (3,327 nodes) are small enough that full-batch GCN training is instant. PubMed's 19,717 nodes are still manageable on a CPU, but you start to notice training time. This is where practitioners first encounter the scalability questions that dominate production GNN work: Should I use mini-batching? How many neighbors should I sample per layer? Is full-batch gradient descent still feasible?

The medical domain also matters. PubMed demonstrates that citation networks exist in specialized domains (not just CS), and that the graph learning approach generalizes across subject areas.

Loading PubMed in PyG

load_pubmed.py
from torch_geometric.datasets import Planetoid

dataset = Planetoid(root='/tmp/PubMed', name='PubMed')
data = dataset[0]

print(f"Nodes: {data.num_nodes}")        # 19717
print(f"Edges: {data.num_edges}")        # 88648
print(f"Features: {data.num_features}")  # 500
print(f"Classes: {dataset.num_classes}") # 3
print(f"Train nodes: {data.train_mask.sum()}")  # 60

Standard Planetoid split: 60 training nodes (20 per class), 500 validation, 1,000 test.

Original Paper

Collective Classification in Network Data

Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Gallagher, Tina Eliassi-Rad (2008). AI Magazine, 29(3), 93-106

Read paper →

Benchmark comparison (standard Planetoid split)

MethodAccuracyYearPaper
MLP (no graph)~71.4%--Baseline
GCN79.0%2017Kipf & Welling
GAT79.0%2018Velickovic et al.
APPNP80.1%2019Klicpera et al.
GCNII80.3%2020Chen et al.

Which Planetoid dataset should I use?

Cora (2,708 nodes) is the fastest sanity check. CiteSeer (3,327 nodes) tests sparse-graph robustness. PubMed (19,717 nodes) is 7x larger and tests whether your pipeline scales. PubMed has only 3 classes, so it is less useful for differentiating architectures and more useful for validating scalability. If full-batch training on PubMed is too slow, you need sampling or mini-batching -- and you will definitely need them for Reddit or OGB.

Common tasks and benchmarks

The standard task is transductive node classification. With only 60 labeled training nodes (20 per class) out of 19,717 total, PubMed is an extreme semi-supervised setting: 0.3% of nodes are labeled. This makes graph structure critical -- the model must propagate label information through citation links to reach the vast majority of unlabeled nodes.

Benchmark results: GCN ~79.0%, GAT ~79.0%, APPNP ~80.1%, GCNII ~80.3%. The narrow performance range reflects the simplicity of the 3-class task rather than model equivalence.

Data source

The PubMed Diabetes dataset is available from the LINQS group at UC Santa Cruz. The Planetoid version used by PyG is downloaded automatically.

BibTeX citation

pubmed.bib
@article{sen2008collective,
  title={Collective Classification in Network Data},
  author={Sen, Prithviraj and Namata, Galileo and Bilgic, Mustafa and Getoor, Lise and Gallagher, Brian and Eliassi-Rad, Tina},
  journal={AI Magazine},
  volume={29},
  number={3},
  pages={93--106},
  year={2008}
}

@inproceedings{yang2016revisiting,
  title={Revisiting Semi-Supervised Learning with Graph Embeddings},
  author={Yang, Zhilin and Cohen, William and Salakhutdinov, Ruslan},
  booktitle={ICML},
  year={2016}
}

Cite Sen et al. for the dataset, Yang et al. for the Planetoid split.

Example: medical literature discovery

PubMed's structure directly mirrors a real business problem: automated literature categorization for pharmaceutical companies. Drug researchers need to monitor thousands of new papers monthly, classifying them by therapeutic area, mechanism of action, or relevance to specific drug programs. A GNN trained on citation structure can classify new papers as they appear, using their citation connections to established literature.

From benchmark to production

PubMed at 20K nodes is a preview of real medical knowledge graphs that contain millions of entities: papers, genes, proteins, drugs, diseases, and clinical trials, all interconnected. These production graphs are heterogeneous (multiple entity types), temporal (new papers daily), and massive (PubMed itself contains 35M+ articles).

Frequently asked questions

What is the PubMed dataset in PyTorch Geometric?

PubMed is a citation network of 19,717 diabetes-related scientific papers from the PubMed database. Each paper is represented by a 500-dimensional TF-IDF feature vector and classified into one of 3 categories: Diabetes Mellitus Type 1, Diabetes Mellitus Type 2, or Diabetes Mellitus Experimental. Edges represent citation links.

How is PubMed different from Cora and CiteSeer?

PubMed is 7x larger than Cora (19,717 vs 2,708 nodes) with 8x more edges (88,648 vs 10,556). It has lower-dimensional features (500 vs 1,433) and fewer classes (3 vs 7). Its larger size makes it a better test of scalability while still fitting in CPU memory.

How do I load PubMed in PyTorch Geometric?

Use `from torch_geometric.datasets import Planetoid; dataset = Planetoid(root='/tmp/PubMed', name='PubMed')`. The API is identical to Cora and CiteSeer.

What accuracy should I expect on PubMed?

GCN achieves ~79.0%, GAT ~79.0% (they tie at this scale), and APPNP ~80.1%. PubMed is easier to classify than CiteSeer because it has only 3 classes and denser connections, but the larger scale introduces computational considerations not present in Cora.

Is PubMed large enough for meaningful benchmarks?

PubMed is a useful middle ground: large enough to test basic scalability (mini-batching, sampling) but small enough to train quickly. For true large-scale benchmarks, use Reddit (232K nodes), OGB-Products (2.4M nodes), or OGB-Papers100M (111M nodes).

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.