What is the Cora dataset?

Cora is a citation network of 2,708 machine learning papers classified into 7 categories. Each paper is represented by a 1,433-dimensional bag-of-words feature vector. Edges represent citation links between papers. It is the most widely used benchmark for evaluating graph neural networks on node classification.

How do I load the Cora dataset in PyTorch Geometric?

Use `from torch_geometric.datasets import Planetoid; dataset = Planetoid(root='/tmp/Cora', name='Cora')`. The dataset object contains a single graph accessible via `dataset[0]` with attributes `x` (node features), `edge_index` (edges), `y` (labels), and `train_mask`/`val_mask`/`test_mask` for the standard split.

What is a good accuracy on Cora?

A 2-layer GCN achieves ~81.5% test accuracy on the standard split. GATConv reaches ~83%. State-of-the-art methods (graph transformers, label propagation combinations) can exceed 85%. If your model scores below 78%, something is likely misconfigured.

Why is Cora so popular as a GNN benchmark?

Cora is popular because it is small enough to train in seconds on a CPU, has a well-established standard split for reproducible comparisons, demonstrates clear benefits of graph structure over flat features, and has been used since the earliest GNN papers (Kipf & Welling 2016), creating a long baseline of published results.

What are the limitations of Cora as a benchmark?

Cora has only 2,708 nodes, making it orders of magnitude smaller than real-world graphs. It is homogeneous (one node type, one edge type), has no temporal information, and uses a fixed train/test split that can overfit to specific architectures. Results on Cora often do not predict performance on production-scale heterogeneous graphs.

Cora Dataset: The Standard GNN Benchmark | PyG Guide

2,708

Nodes

10,556

Edges

1,433

Features

Classes

What Cora contains

Cora is a citation network collected from machine learning papers. Each of the 2,708 nodes represents a paper. Each of the 10,556 edges represents a citation link (paper A cites paper B). The features are bag-of-words vectors: 1,433 binary values indicating which words from a fixed dictionary appear in the paper. The task is to classify each paper into one of 7 categories: Case-Based, Genetic Algorithms, Neural Networks, Probabilistic Methods, Reinforcement Learning, Rule Learning, or Theory.

Why Cora matters

Cora occupies a unique position in graph ML research. It was used in the original GCN paper (Kipf & Welling, 2016) and has since appeared in virtually every GNN paper published. Its importance is practical rather than scientific: Cora is a sanity check. If your new GNN layer cannot beat a basic GCN on Cora, something is wrong with your implementation before you worry about novelty.

The dataset also demonstrates why graph structure matters. A logistic regression model using only node features achieves ~60% accuracy. Adding citation structure via GCN jumps to ~81%. That 21-point gap is the clearest demonstration of graph learning's value in the entire literature.

Loading Cora in PyG

load_cora.py

from torch_geometric.datasets import Planetoid

dataset = Planetoid(root='/tmp/Cora', name='Cora')
data = dataset[0]  # Single graph

print(f"Nodes: {data.num_nodes}")        # 2708
print(f"Edges: {data.num_edges}")        # 10556
print(f"Features: {data.num_features}")  # 1433
print(f"Classes: {dataset.num_classes}") # 7
print(f"Train nodes: {data.train_mask.sum()}")  # 140

The Planetoid loader downloads Cora automatically and provides standard train/val/test splits.

Original Paper

Automating the Construction of Internet Portals with Machine Learning

Andrew McCallum, Kamal Nigam, Jason Rennie, Kristie Seymore (2000). Information Retrieval, 3(2), 127-163

Read paper →

Benchmark comparison (standard Planetoid split)

Method	Accuracy	Year	Paper
MLP (no graph)	~61.0%	--	Baseline
GCN	81.5%	2017	Kipf & Welling
GAT	83.0%	2018	Velickovic et al.
APPNP	83.3%	2019	Klicpera et al.
GCNII	85.5%	2020	Chen et al.
GPSConv	~83.2%	2022	Rampasek et al.

Which Planetoid dataset should I use?

The three Planetoid datasets form a natural progression. Cora (2,708 nodes, 7 classes) is the densest and easiest -- use it as a sanity check. CiteSeer (3,327 nodes, 6 classes) is sparser (avg degree 2.7 vs 3.9), making it harder; use it to test robustness to limited neighborhood information. PubMed (19,717 nodes, 3 classes) is 7x larger but has only 3 classes; use it to validate that your training pipeline scales beyond toy size. If your GNN beats GCN on Cora but not on CiteSeer, it likely over-relies on dense graph structure. If it fails on PubMed, check your scalability.

Common tasks and benchmarks

The standard task on Cora is transductive semi-supervised node classification. Only 140 nodes (20 per class) are labeled for training, 500 for validation, and 1,000 for testing. The model must use the full graph structure during training but only predicts labels for test nodes.

Benchmark results on the standard split: GCN ~81.5%, GAT ~83.0%, APPNP ~83.3%, GCNII ~85.5%. The differences are small because Cora is a relatively easy dataset. The real differentiator between architectures shows on harder, larger benchmarks.

Data source

The original Cora dataset can be downloaded from the LINQS group at UC Santa Cruz. The Planetoid version used by PyG is hosted by the PyG team and downloaded automatically.

BibTeX citation

cora.bib

@article{mccallum2000automating,
  title={Automating the Construction of Internet Portals with Machine Learning},
  author={McCallum, Andrew and Nigam, Kamal and Rennie, Jason and Seymore, Kristie},
  journal={Information Retrieval},
  volume={3},
  number={2},
  pages={127--163},
  year={2000},
  publisher={Springer}
}

@inproceedings{yang2016revisiting,
  title={Revisiting Semi-Supervised Learning with Graph Embeddings},
  author={Yang, Zhilin and Cohen, William and Salakhutdinov, Ruslan},
  booktitle={ICML},
  year={2016}
}

Cite McCallum et al. for the dataset, Yang et al. for the Planetoid split used by PyG.

Example: mapping to a business problem

Cora's structure (entities connected by relationships, classified by type) maps directly to enterprise problems. Replace papers with customers, citations with transactions, and categories with churn risk segments. The same message-passing principle applies: a customer's behavior is best predicted by looking at who they interact with, not just their individual features.

A bank classifying accounts by risk level is solving the same structural problem as classifying Cora papers by topic. The difference is scale (millions of accounts vs. 2,708 papers) and complexity (dozens of relationship types vs. one citation type).

From benchmark to production

Cora teaches the fundamental insight: neighborhood structure is predictive. But production graphs differ in three critical ways. First, real graphs have millions to billions of nodes, not 2,708. Second, real data is heterogeneous: customers, products, transactions, and merchants are different entity types with different features. Third, real data is temporal: a transaction from yesterday matters more than one from last year.

Cora: The Dataset Every GNN Paper Starts With