Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Guide6 min read

Citation Networks: Academic Papers as Nodes, Citations as Edges

Cora, CiteSeer, and PubMed are where nearly every GNN paper starts. Citation networks are the MNIST of graph ML: simple enough for quick experiments, structured enough to demonstrate why graph learning works.

PyTorch Geometric

TL;DR

  • 1Citation networks are directed graphs: papers are nodes (features: word vectors), citations are edges (paper A cites paper B). The task is classifying papers into research topics.
  • 2Graph structure helps because papers that cite each other tend to share topics. A GCN with 2 layers on Cora achieves 81.5% accuracy, vs 59% using only node features (no graph).
  • 3Classic datasets: Cora (2,708 nodes, 7 classes), CiteSeer (3,327 nodes, 6 classes), PubMed (19,717 nodes, 3 classes). Large-scale: ogbn-arxiv (169K nodes), ogbn-papers100M (111M nodes).
  • 4Homophily is high: 81% of Cora edges connect same-class nodes. This is why simple GNNs (GCN, GAT) work well. Heterophilous graphs require different architectures.
  • 5Limitations as benchmarks: small size, high homophily, and transductive setup make citation networks unrepresentative of production graph tasks. Use OGB and RelBench for serious evaluation.

A citation network is a graph where academic papers are nodes and citations are directed edges. Paper A citing paper B creates an edge from A to B. Each paper node carries features derived from its content (bag-of-words or language model embeddings) and a label indicating its research topic. The task is node classification: predict each paper's topic from its content and citation context.

Why citation networks demonstrate GNN value

The key property of citation networks is homophily: papers that cite each other tend to be in the same field. In Cora, 81% of edges connect papers with the same label. This means a GNN that aggregates neighbor information naturally receives confirming evidence about the node's class.

A logistic regression on paper features alone achieves 59% accuracy on Cora. Adding citation structure via a 2-layer GCN raises this to 81.5%. The 22.5% improvement comes entirely from graph structure: knowing what a paper cites and what cites it.

Standard datasets

  • Cora: 2,708 papers, 5,429 edges, 7 classes (Case-Based, Genetic Algorithms, Neural Networks, Probabilistic Methods, Reinforcement Learning, Rule Learning, Theory). Features: 1,433-dim binary word vectors.
  • CiteSeer: 3,327 papers, 4,732 edges, 6 classes. Similar structure to Cora but slightly harder (lower homophily).
  • PubMed: 19,717 papers, 44,338 edges, 3 classes (Diabetes Type 1, Type 2, Experimental). TF-IDF features.
  • ogbn-arxiv: 169,343 papers, 1.2M edges, 40 classes. From OGB, realistic scale with temporal split.

What GNNs learn on citation networks

After 2 layers of message passing on a citation network:

  • Layer 1: Each paper absorbs the topic signals from papers it cites and papers that cite it. A neural networks paper cited by 5 other neural networks papers gets a strong topic signal.
  • Layer 2: Each paper absorbs 2-hop context: the topics of papers cited by its citations. This captures broader field relationships.

The result: even papers with ambiguous content (a paper about “learning” that could be RL or neural networks) get classified correctly based on their citation neighborhood.

Limitations as benchmarks

Citation networks have important limitations for evaluating GNNs:

  • Too small: Cora has 2,708 nodes. Variance across random seeds is 1-2%, making it hard to distinguish methods.
  • High homophily: 81% same-class edges means even simple label propagation works well. Does not test performance on heterophilous graphs.
  • Transductive evaluation: All nodes are visible during training (only labels are masked). This does not reflect production settings where new nodes arrive constantly.
  • Not temporal: Standard splits are random, but citations are temporal (you can only cite older papers). Temporal splits give different, more realistic results.

For serious GNN evaluation, use OGB (ogbn-arxiv, ogbn-products) or RelBench, which address all four limitations.

Frequently asked questions

What is a citation network?

A citation network is a directed graph where nodes represent academic papers and edges represent citations (paper A cites paper B). Node features are typically bag-of-words or language model embeddings of the paper's abstract. The classic GNN benchmark task is classifying each paper into its research topic using both its content and citation structure.

Why are citation networks so popular for GNN research?

Three reasons: (1) They are real-world graphs with meaningful structure. (2) They are small enough for quick experimentation (Cora has 2,708 nodes). (3) The task (paper classification) is intuitive and demonstrates the value of graph structure: papers that cite each other tend to be in the same field.

What are the standard citation network datasets?

Cora (2,708 papers, 7 classes), CiteSeer (3,327 papers, 6 classes), and PubMed (19,717 papers, 3 classes) are the classic small benchmarks. ogbn-arxiv (169,343 papers, 40 classes) and ogbn-papers100M (111M papers) are large-scale alternatives from Open Graph Benchmark.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.