19,717
Nodes
88,648
Edges
500
Features
3
Classes
What PubMed contains
PubMed is a citation network drawn from the PubMed biomedical literature database. The 19,717 nodes are papers about diabetes, connected by 88,648 citation edges. Each paper has a 500-dimensional TF-IDF weighted word vector as features. The classification task has 3 categories: Diabetes Mellitus Type 1, Diabetes Mellitus Type 2, and Diabetes Mellitus Experimental.
With only 3 classes and relatively dense connections (average degree ~4.5), PubMed is easier to classify than CiteSeer. Its value lies in its size: at nearly 20,000 nodes and 89,000 edges, it tests whether your code and training pipeline handle graphs that no longer fit in a single matrix multiplication without memory concerns.
Why PubMed matters
PubMed sits at an important inflection point. Cora (2,708 nodes) and CiteSeer (3,327 nodes) are small enough that full-batch GCN training is instant. PubMed's 19,717 nodes are still manageable on a CPU, but you start to notice training time. This is where practitioners first encounter the scalability questions that dominate production GNN work: Should I use mini-batching? How many neighbors should I sample per layer? Is full-batch gradient descent still feasible?
The medical domain also matters. PubMed demonstrates that citation networks exist in specialized domains (not just CS), and that the graph learning approach generalizes across subject areas.
Loading PubMed in PyG
from torch_geometric.datasets import Planetoid
dataset = Planetoid(root='/tmp/PubMed', name='PubMed')
data = dataset[0]
print(f"Nodes: {data.num_nodes}") # 19717
print(f"Edges: {data.num_edges}") # 88648
print(f"Features: {data.num_features}") # 500
print(f"Classes: {dataset.num_classes}") # 3
print(f"Train nodes: {data.train_mask.sum()}") # 60Standard Planetoid split: 60 training nodes (20 per class), 500 validation, 1,000 test.
Original Paper
Collective Classification in Network Data
Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Gallagher, Tina Eliassi-Rad (2008). AI Magazine, 29(3), 93-106
Read paper →Benchmark comparison (standard Planetoid split)
| Method | Accuracy | Year | Paper |
|---|---|---|---|
| MLP (no graph) | ~71.4% | -- | Baseline |
| GCN | 79.0% | 2017 | Kipf & Welling |
| GAT | 79.0% | 2018 | Velickovic et al. |
| APPNP | 80.1% | 2019 | Klicpera et al. |
| GCNII | 80.3% | 2020 | Chen et al. |
Which Planetoid dataset should I use?
Cora (2,708 nodes) is the fastest sanity check. CiteSeer (3,327 nodes) tests sparse-graph robustness. PubMed (19,717 nodes) is 7x larger and tests whether your pipeline scales. PubMed has only 3 classes, so it is less useful for differentiating architectures and more useful for validating scalability. If full-batch training on PubMed is too slow, you need sampling or mini-batching -- and you will definitely need them for Reddit or OGB.
Common tasks and benchmarks
The standard task is transductive node classification. With only 60 labeled training nodes (20 per class) out of 19,717 total, PubMed is an extreme semi-supervised setting: 0.3% of nodes are labeled. This makes graph structure critical -- the model must propagate label information through citation links to reach the vast majority of unlabeled nodes.
Benchmark results: GCN ~79.0%, GAT ~79.0%, APPNP ~80.1%, GCNII ~80.3%. The narrow performance range reflects the simplicity of the 3-class task rather than model equivalence.
Data source
The PubMed Diabetes dataset is available from the LINQS group at UC Santa Cruz. The Planetoid version used by PyG is downloaded automatically.
BibTeX citation
@article{sen2008collective,
title={Collective Classification in Network Data},
author={Sen, Prithviraj and Namata, Galileo and Bilgic, Mustafa and Getoor, Lise and Gallagher, Brian and Eliassi-Rad, Tina},
journal={AI Magazine},
volume={29},
number={3},
pages={93--106},
year={2008}
}
@inproceedings{yang2016revisiting,
title={Revisiting Semi-Supervised Learning with Graph Embeddings},
author={Yang, Zhilin and Cohen, William and Salakhutdinov, Ruslan},
booktitle={ICML},
year={2016}
}Cite Sen et al. for the dataset, Yang et al. for the Planetoid split.
Example: medical literature discovery
PubMed's structure directly mirrors a real business problem: automated literature categorization for pharmaceutical companies. Drug researchers need to monitor thousands of new papers monthly, classifying them by therapeutic area, mechanism of action, or relevance to specific drug programs. A GNN trained on citation structure can classify new papers as they appear, using their citation connections to established literature.
From benchmark to production
PubMed at 20K nodes is a preview of real medical knowledge graphs that contain millions of entities: papers, genes, proteins, drugs, diseases, and clinical trials, all interconnected. These production graphs are heterogeneous (multiple entity types), temporal (new papers daily), and massive (PubMed itself contains 35M+ articles).