3,327
Nodes
9,104
Edges
3,703
Features
6
Classes
What CiteSeer contains
CiteSeer is a citation network of scientific papers from the CiteSeer digital library. Each of the 3,327 nodes is a paper. Each of the 9,104 edges is a citation link. Node features are 3,703-dimensional bag-of-words vectors derived from paper text. The task is to classify papers into 6 categories: Agents, AI, DB, IR, ML, and HCI.
Compared to Cora, CiteSeer has higher-dimensional features but a sparser graph. The average node degree is about 2.7 (vs 3.9 in Cora), meaning each paper has fewer citation connections to learn from. This sparsity is the defining challenge.
Why CiteSeer matters
CiteSeer fills a specific gap in the benchmark ecosystem. Cora is dense enough that even simple aggregation works well. CiteSeer's sparsity forces models to extract more from less. This tests two capabilities: how well the model uses node features when graph structure is limited, and whether attention mechanisms (GATConv, TransformerConv) can identify the few high-value connections in a sparse neighborhood.
Real-world graphs are often sparse. A fraud detection graph has millions of legitimate transactions for every suspicious one. A recommendation graph has far more items than any user has interacted with. CiteSeer's sparsity, while mild, points toward these real challenges.
Loading CiteSeer in PyG
from torch_geometric.datasets import Planetoid
dataset = Planetoid(root='/tmp/CiteSeer', name='CiteSeer')
data = dataset[0]
print(f"Nodes: {data.num_nodes}") # 3327
print(f"Edges: {data.num_edges}") # 9104
print(f"Features: {data.num_features}") # 3703
print(f"Classes: {dataset.num_classes}") # 6
print(f"Avg degree: {data.num_edges / data.num_nodes:.1f}") # ~2.7Same Planetoid API as Cora. The standard split uses 120 training nodes (20 per class).
Original Paper
CiteSeer: An Automatic Citation Indexing System
C. Lee Giles, Kurt D. Bollacker, Steve Lawrence (1998). ACM DL '98
Read paper →Benchmark comparison (standard Planetoid split)
| Method | Accuracy | Year | Paper |
|---|---|---|---|
| MLP (no graph) | ~58.0% | -- | Baseline |
| GCN | 70.3% | 2017 | Kipf & Welling |
| GAT | 72.5% | 2018 | Velickovic et al. |
| APPNP | 71.8% | 2019 | Klicpera et al. |
| GCNII | 73.4% | 2020 | Chen et al. |
Which Planetoid dataset should I use?
Cora (2,708 nodes, avg degree 3.9) is the easiest -- use it to confirm your code works. CiteSeer (3,327 nodes, avg degree 2.7) is sparser, dropping GCN from 81% to 70%; use it to test robustness to sparse neighborhoods. PubMed (19,717 nodes, avg degree 4.5) is the largest and has only 3 classes; use it to test scalability. Run all three together: if your method improves on Cora but degrades on CiteSeer, it may depend too heavily on graph density.
Common tasks and benchmarks
Like Cora, the standard task is transductive semi-supervised node classification with 120 labeled training nodes (20 per class), 500 validation nodes, and 1,000 test nodes. Benchmark results: GCN ~70.3%, GAT ~72.5%, APPNP ~71.8%, GCNII ~73.4%. The 10-point gap between CiteSeer and Cora scores across all methods confirms that sparsity is the primary difficulty, not model architecture.
Data source
The original CiteSeer dataset is available from the LINQS group at UC Santa Cruz. The Planetoid version used by PyG is downloaded automatically.
BibTeX citation
@inproceedings{giles1998citeseer,
title={CiteSeer: An Automatic Citation Indexing System},
author={Giles, C. Lee and Bollacker, Kurt D. and Lawrence, Steve},
booktitle={Proceedings of the Third ACM Conference on Digital Libraries},
pages={89--98},
year={1998}
}
@inproceedings{yang2016revisiting,
title={Revisiting Semi-Supervised Learning with Graph Embeddings},
author={Yang, Zhilin and Cohen, William and Salakhutdinov, Ruslan},
booktitle={ICML},
year={2016}
}Cite Giles et al. for the dataset, Yang et al. for the Planetoid split.
Example: sparse graphs in enterprise
Most enterprise graphs resemble CiteSeer more than Cora. Consider a B2B SaaS platform where companies are nodes and business relationships are edges. Most companies have only a handful of connections. A recommendation system must predict which other companies a given customer might want to work with, using sparse relationship data plus rich company features -- the same tradeoff CiteSeer presents.
From benchmark to production
CiteSeer teaches that graph sparsity degrades GNN performance. In production, this problem is amplified: new users have zero connections (cold start), new products have no purchase history, and new accounts have no transaction graph. Handling sparsity requires architectures that balance feature-based and structure-based learning.