A citation network is a graph where academic papers are nodes and citations are directed edges. Paper A citing paper B creates an edge from A to B. Each paper node carries features derived from its content (bag-of-words or language model embeddings) and a label indicating its research topic. The task is node classification: predict each paper's topic from its content and citation context.
Why citation networks demonstrate GNN value
The key property of citation networks is homophily: papers that cite each other tend to be in the same field. In Cora, 81% of edges connect papers with the same label. This means a GNN that aggregates neighbor information naturally receives confirming evidence about the node's class.
A logistic regression on paper features alone achieves 59% accuracy on Cora. Adding citation structure via a 2-layer GCN raises this to 81.5%. The 22.5% improvement comes entirely from graph structure: knowing what a paper cites and what cites it.
Standard datasets
- Cora: 2,708 papers, 5,429 edges, 7 classes (Case-Based, Genetic Algorithms, Neural Networks, Probabilistic Methods, Reinforcement Learning, Rule Learning, Theory). Features: 1,433-dim binary word vectors.
- CiteSeer: 3,327 papers, 4,732 edges, 6 classes. Similar structure to Cora but slightly harder (lower homophily).
- PubMed: 19,717 papers, 44,338 edges, 3 classes (Diabetes Type 1, Type 2, Experimental). TF-IDF features.
- ogbn-arxiv: 169,343 papers, 1.2M edges, 40 classes. From OGB, realistic scale with temporal split.
What GNNs learn on citation networks
After 2 layers of message passing on a citation network:
- Layer 1: Each paper absorbs the topic signals from papers it cites and papers that cite it. A neural networks paper cited by 5 other neural networks papers gets a strong topic signal.
- Layer 2: Each paper absorbs 2-hop context: the topics of papers cited by its citations. This captures broader field relationships.
The result: even papers with ambiguous content (a paper about “learning” that could be RL or neural networks) get classified correctly based on their citation neighborhood.
Limitations as benchmarks
Citation networks have important limitations for evaluating GNNs:
- Too small: Cora has 2,708 nodes. Variance across random seeds is 1-2%, making it hard to distinguish methods.
- High homophily: 81% same-class edges means even simple label propagation works well. Does not test performance on heterophilous graphs.
- Transductive evaluation: All nodes are visible during training (only labels are masked). This does not reflect production settings where new nodes arrive constantly.
- Not temporal: Standard splits are random, but citations are temporal (you can only cite older papers). Temporal splits give different, more realistic results.
For serious GNN evaluation, use OGB (ogbn-arxiv, ogbn-products) or RelBench, which address all four limitations.