2,708
Nodes
10,556
Edges
1,433
Features
7
Classes
What Cora contains
Cora is a citation network collected from machine learning papers. Each of the 2,708 nodes represents a paper. Each of the 10,556 edges represents a citation link (paper A cites paper B). The features are bag-of-words vectors: 1,433 binary values indicating which words from a fixed dictionary appear in the paper. The task is to classify each paper into one of 7 categories: Case-Based, Genetic Algorithms, Neural Networks, Probabilistic Methods, Reinforcement Learning, Rule Learning, or Theory.
Why Cora matters
Cora occupies a unique position in graph ML research. It was used in the original GCN paper (Kipf & Welling, 2016) and has since appeared in virtually every GNN paper published. Its importance is practical rather than scientific: Cora is a sanity check. If your new GNN layer cannot beat a basic GCN on Cora, something is wrong with your implementation before you worry about novelty.
The dataset also demonstrates why graph structure matters. A logistic regression model using only node features achieves ~60% accuracy. Adding citation structure via GCN jumps to ~81%. That 21-point gap is the clearest demonstration of graph learning's value in the entire literature.
Loading Cora in PyG
from torch_geometric.datasets import Planetoid
dataset = Planetoid(root='/tmp/Cora', name='Cora')
data = dataset[0] # Single graph
print(f"Nodes: {data.num_nodes}") # 2708
print(f"Edges: {data.num_edges}") # 10556
print(f"Features: {data.num_features}") # 1433
print(f"Classes: {dataset.num_classes}") # 7
print(f"Train nodes: {data.train_mask.sum()}") # 140The Planetoid loader downloads Cora automatically and provides standard train/val/test splits.
Original Paper
Automating the Construction of Internet Portals with Machine Learning
Andrew McCallum, Kamal Nigam, Jason Rennie, Kristie Seymore (2000). Information Retrieval, 3(2), 127-163
Read paper →Benchmark comparison (standard Planetoid split)
| Method | Accuracy | Year | Paper |
|---|---|---|---|
| MLP (no graph) | ~61.0% | -- | Baseline |
| GCN | 81.5% | 2017 | Kipf & Welling |
| GAT | 83.0% | 2018 | Velickovic et al. |
| APPNP | 83.3% | 2019 | Klicpera et al. |
| GCNII | 85.5% | 2020 | Chen et al. |
| GPSConv | ~83.2% | 2022 | Rampasek et al. |
Which Planetoid dataset should I use?
The three Planetoid datasets form a natural progression. Cora (2,708 nodes, 7 classes) is the densest and easiest -- use it as a sanity check. CiteSeer (3,327 nodes, 6 classes) is sparser (avg degree 2.7 vs 3.9), making it harder; use it to test robustness to limited neighborhood information. PubMed (19,717 nodes, 3 classes) is 7x larger but has only 3 classes; use it to validate that your training pipeline scales beyond toy size. If your GNN beats GCN on Cora but not on CiteSeer, it likely over-relies on dense graph structure. If it fails on PubMed, check your scalability.
Common tasks and benchmarks
The standard task on Cora is transductive semi-supervised node classification. Only 140 nodes (20 per class) are labeled for training, 500 for validation, and 1,000 for testing. The model must use the full graph structure during training but only predicts labels for test nodes.
Benchmark results on the standard split: GCN ~81.5%, GAT ~83.0%, APPNP ~83.3%, GCNII ~85.5%. The differences are small because Cora is a relatively easy dataset. The real differentiator between architectures shows on harder, larger benchmarks.
Data source
The original Cora dataset can be downloaded from the LINQS group at UC Santa Cruz. The Planetoid version used by PyG is hosted by the PyG team and downloaded automatically.
BibTeX citation
@article{mccallum2000automating,
title={Automating the Construction of Internet Portals with Machine Learning},
author={McCallum, Andrew and Nigam, Kamal and Rennie, Jason and Seymore, Kristie},
journal={Information Retrieval},
volume={3},
number={2},
pages={127--163},
year={2000},
publisher={Springer}
}
@inproceedings{yang2016revisiting,
title={Revisiting Semi-Supervised Learning with Graph Embeddings},
author={Yang, Zhilin and Cohen, William and Salakhutdinov, Ruslan},
booktitle={ICML},
year={2016}
}Cite McCallum et al. for the dataset, Yang et al. for the Planetoid split used by PyG.
Example: mapping to a business problem
Cora's structure (entities connected by relationships, classified by type) maps directly to enterprise problems. Replace papers with customers, citations with transactions, and categories with churn risk segments. The same message-passing principle applies: a customer's behavior is best predicted by looking at who they interact with, not just their individual features.
A bank classifying accounts by risk level is solving the same structural problem as classifying Cora papers by topic. The difference is scale (millions of accounts vs. 2,708 papers) and complexity (dozens of relationship types vs. one citation type).
From benchmark to production
Cora teaches the fundamental insight: neighborhood structure is predictive. But production graphs differ in three critical ways. First, real graphs have millions to billions of nodes, not 2,708. Second, real data is heterogeneous: customers, products, transactions, and merchants are different entity types with different features. Third, real data is temporal: a transaction from yesterday matters more than one from last year.