Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Dataset7 min read

Cora: The Dataset Every GNN Paper Starts With

Cora is a citation network of 2,708 machine learning papers. It is the MNIST of graph neural networks: small enough to iterate fast, established enough that every method reports results on it, and simple enough that it teaches the fundamentals without distracting complexity.

PyTorch Geometric

TL;DR

  • 1Cora contains 2,708 papers (nodes) connected by 10,556 citation links (edges). Each paper has a 1,433-dimensional bag-of-words feature vector and belongs to one of 7 CS subcategories.
  • 2A 2-layer GCN achieves ~81.5% accuracy. GAT gets ~83%. The standard Planetoid split uses 140 training nodes (20 per class), making it a semi-supervised learning benchmark.
  • 3Cora trains in seconds on a CPU. It is the fastest way to validate that your GNN code works before scaling to larger datasets.
  • 4The dataset is too small and homogeneous for production conclusions. If your model works on Cora but not on real data, the gap is usually heterogeneity, scale, or temporal dynamics.
  • 5KumoRFM handles production citation-style problems (recommendation, fraud, churn) on graphs millions of times larger than Cora, with multiple node and edge types, automatically.

2,708

Nodes

10,556

Edges

1,433

Features

7

Classes

What Cora contains

Cora is a citation network collected from machine learning papers. Each of the 2,708 nodes represents a paper. Each of the 10,556 edges represents a citation link (paper A cites paper B). The features are bag-of-words vectors: 1,433 binary values indicating which words from a fixed dictionary appear in the paper. The task is to classify each paper into one of 7 categories: Case-Based, Genetic Algorithms, Neural Networks, Probabilistic Methods, Reinforcement Learning, Rule Learning, or Theory.

Why Cora matters

Cora occupies a unique position in graph ML research. It was used in the original GCN paper (Kipf & Welling, 2016) and has since appeared in virtually every GNN paper published. Its importance is practical rather than scientific: Cora is a sanity check. If your new GNN layer cannot beat a basic GCN on Cora, something is wrong with your implementation before you worry about novelty.

The dataset also demonstrates why graph structure matters. A logistic regression model using only node features achieves ~60% accuracy. Adding citation structure via GCN jumps to ~81%. That 21-point gap is the clearest demonstration of graph learning's value in the entire literature.

Loading Cora in PyG

load_cora.py
from torch_geometric.datasets import Planetoid

dataset = Planetoid(root='/tmp/Cora', name='Cora')
data = dataset[0]  # Single graph

print(f"Nodes: {data.num_nodes}")        # 2708
print(f"Edges: {data.num_edges}")        # 10556
print(f"Features: {data.num_features}")  # 1433
print(f"Classes: {dataset.num_classes}") # 7
print(f"Train nodes: {data.train_mask.sum()}")  # 140

The Planetoid loader downloads Cora automatically and provides standard train/val/test splits.

Original Paper

Automating the Construction of Internet Portals with Machine Learning

Andrew McCallum, Kamal Nigam, Jason Rennie, Kristie Seymore (2000). Information Retrieval, 3(2), 127-163

Read paper →

Benchmark comparison (standard Planetoid split)

MethodAccuracyYearPaper
MLP (no graph)~61.0%--Baseline
GCN81.5%2017Kipf & Welling
GAT83.0%2018Velickovic et al.
APPNP83.3%2019Klicpera et al.
GCNII85.5%2020Chen et al.
GPSConv~83.2%2022Rampasek et al.

Which Planetoid dataset should I use?

The three Planetoid datasets form a natural progression. Cora (2,708 nodes, 7 classes) is the densest and easiest -- use it as a sanity check. CiteSeer (3,327 nodes, 6 classes) is sparser (avg degree 2.7 vs 3.9), making it harder; use it to test robustness to limited neighborhood information. PubMed (19,717 nodes, 3 classes) is 7x larger but has only 3 classes; use it to validate that your training pipeline scales beyond toy size. If your GNN beats GCN on Cora but not on CiteSeer, it likely over-relies on dense graph structure. If it fails on PubMed, check your scalability.

Common tasks and benchmarks

The standard task on Cora is transductive semi-supervised node classification. Only 140 nodes (20 per class) are labeled for training, 500 for validation, and 1,000 for testing. The model must use the full graph structure during training but only predicts labels for test nodes.

Benchmark results on the standard split: GCN ~81.5%, GAT ~83.0%, APPNP ~83.3%, GCNII ~85.5%. The differences are small because Cora is a relatively easy dataset. The real differentiator between architectures shows on harder, larger benchmarks.

Data source

The original Cora dataset can be downloaded from the LINQS group at UC Santa Cruz. The Planetoid version used by PyG is hosted by the PyG team and downloaded automatically.

BibTeX citation

cora.bib
@article{mccallum2000automating,
  title={Automating the Construction of Internet Portals with Machine Learning},
  author={McCallum, Andrew and Nigam, Kamal and Rennie, Jason and Seymore, Kristie},
  journal={Information Retrieval},
  volume={3},
  number={2},
  pages={127--163},
  year={2000},
  publisher={Springer}
}

@inproceedings{yang2016revisiting,
  title={Revisiting Semi-Supervised Learning with Graph Embeddings},
  author={Yang, Zhilin and Cohen, William and Salakhutdinov, Ruslan},
  booktitle={ICML},
  year={2016}
}

Cite McCallum et al. for the dataset, Yang et al. for the Planetoid split used by PyG.

Example: mapping to a business problem

Cora's structure (entities connected by relationships, classified by type) maps directly to enterprise problems. Replace papers with customers, citations with transactions, and categories with churn risk segments. The same message-passing principle applies: a customer's behavior is best predicted by looking at who they interact with, not just their individual features.

A bank classifying accounts by risk level is solving the same structural problem as classifying Cora papers by topic. The difference is scale (millions of accounts vs. 2,708 papers) and complexity (dozens of relationship types vs. one citation type).

From benchmark to production

Cora teaches the fundamental insight: neighborhood structure is predictive. But production graphs differ in three critical ways. First, real graphs have millions to billions of nodes, not 2,708. Second, real data is heterogeneous: customers, products, transactions, and merchants are different entity types with different features. Third, real data is temporal: a transaction from yesterday matters more than one from last year.

Frequently asked questions

What is the Cora dataset?

Cora is a citation network of 2,708 machine learning papers classified into 7 categories. Each paper is represented by a 1,433-dimensional bag-of-words feature vector. Edges represent citation links between papers. It is the most widely used benchmark for evaluating graph neural networks on node classification.

How do I load the Cora dataset in PyTorch Geometric?

Use `from torch_geometric.datasets import Planetoid; dataset = Planetoid(root='/tmp/Cora', name='Cora')`. The dataset object contains a single graph accessible via `dataset[0]` with attributes `x` (node features), `edge_index` (edges), `y` (labels), and `train_mask`/`val_mask`/`test_mask` for the standard split.

What is a good accuracy on Cora?

A 2-layer GCN achieves ~81.5% test accuracy on the standard split. GATConv reaches ~83%. State-of-the-art methods (graph transformers, label propagation combinations) can exceed 85%. If your model scores below 78%, something is likely misconfigured.

Why is Cora so popular as a GNN benchmark?

Cora is popular because it is small enough to train in seconds on a CPU, has a well-established standard split for reproducible comparisons, demonstrates clear benefits of graph structure over flat features, and has been used since the earliest GNN papers (Kipf & Welling 2016), creating a long baseline of published results.

What are the limitations of Cora as a benchmark?

Cora has only 2,708 nodes, making it orders of magnitude smaller than real-world graphs. It is homogeneous (one node type, one edge type), has no temporal information, and uses a fixed train/test split that can overfit to specific architectures. Results on Cora often do not predict performance on production-scale heterogeneous graphs.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.