18,333
Nodes
163,788
Edges
6,805
Features
15
Classes
What Coauthor CS contains
Coauthor CS is an academic collaboration network from the Microsoft Academic Graph. Each of the 18,333 nodes represents a computer science researcher. An edge connects two researchers if they co-authored at least one paper. Node features are 6,805-dimensional keyword vectors representing the research topics an author has published on. The 15 classes correspond to CS research fields (AI, databases, theory, systems, etc.).
The dataset exhibits strong homophily: co-authors usually work in the same field. An AI researcher co-authors with other AI researchers, a database researcher with other database researchers. This makes neighborhood aggregation highly informative -- a node's co-authors are excellent predictors of its own field.
Why Coauthor CS matters
Coauthor CS represents a different kind of graph than citations or co-purchases. It models professional relationships between people. This structure appears throughout enterprise: employee collaboration networks, business partner ecosystems, advisor-client relationships, and cross-departmental project teams. GNNs that work on Coauthor CS can be adapted for organizational analytics: predicting team performance, identifying key collaborators, or detecting organizational silos.
The high-dimensional features (6,805) also make Coauthor CS useful for studying the interaction between features and structure. Models can achieve reasonable accuracy using features alone (the keywords are very informative). Adding graph structure improves accuracy further, but the margin is smaller than on Cora where features are less descriptive.
Loading Coauthor CS in PyG
from torch_geometric.datasets import Coauthor
dataset = Coauthor(root='/tmp/Coauthor', name='CS')
data = dataset[0]
print(f"Nodes: {data.num_nodes}") # 18333
print(f"Edges: {data.num_edges}") # 163788
print(f"Features: {data.num_features}") # 6805
print(f"Classes: {dataset.num_classes}") # 15
print(f"Avg degree: {data.num_edges / data.num_nodes:.1f}") # ~8.9No standard split provided. Create random or stratified train/val/test masks.
Original Paper
Pitfalls of Graph Neural Network Evaluation
Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, Stephan Gunnemann (2018). NeurIPS 2018 Workshop on Relational Representation Learning
Read paper →Benchmark comparison (random 20/30/50 splits)
| Method | Accuracy | Year | Paper |
|---|---|---|---|
| MLP (no graph) | ~88.5% | -- | Baseline |
| GCN | ~91.1% | 2017 | Kipf & Welling |
| GAT | ~92.3% | 2018 | Velickovic et al. |
| GraphSAGE | ~91.5% | 2017 | Hamilton et al. |
| GCNII | ~93.0% | 2020 | Chen et al. |
Which Coauthor dataset should I use?
Coauthor CS (18,333 nodes, 15 classes) has more classes and sparser structure (avg degree ~8.9) -- use it to differentiate architectures. Coauthor Physics (34,493 nodes, 5 classes) is larger and denser (avg degree ~14.4) but with fewer classes, making it easier. For comparison with citation networks, Coauthor CS is most directly comparable to PubMed (similar scale, both academic graphs) but models people rather than papers.
Common tasks and benchmarks
Node classification is the primary task: predict each author's research field. Without a standard split, results vary by partition strategy. With a typical 20/30/50 train/val/test split, GCN achieves ~91-93%, GAT ~92-93%, and GraphSAGE ~91-93%. The high baseline across all methods reflects the strong homophily in collaboration networks.
Link prediction is also natural: predict future co-authorships. This maps to collaboration recommendation -- suggesting potential co-authors based on research overlap and network proximity.
Data source
The Coauthor datasets were introduced by Shchur et al. (2018) and are derived from the Microsoft Academic Graph. PyG downloads the processed version automatically.
BibTeX citation
@article{shchur2018pitfalls,
title={Pitfalls of Graph Neural Network Evaluation},
author={Shchur, Oleksandr and Mumme, Maximilian and Bojchevski, Aleksandar and G{\"u}nnemann, Stephan},
journal={arXiv preprint arXiv:1811.05868},
year={2018}
}Cite Shchur et al. for the Coauthor CS benchmark dataset.
Example: organizational analytics
A large enterprise wants to understand its internal collaboration patterns. Employees are nodes, joint projects are edges, and skills are features. The task: classify employees into functional groups and identify cross-functional collaborators who bridge organizational silos. This is structurally identical to Coauthor CS, with employees replacing researchers and business units replacing research fields.
From benchmark to production
Production collaboration networks are dynamic (people join, leave, change roles), heterogeneous (employees, contractors, partners, clients), and multi-relational (co-authored, reviewed, managed, mentored). Coauthor CS is a static, homogeneous snapshot. Bridging this gap requires temporal modeling, heterogeneous message passing, and continuous updating as the graph evolves.