34,493
Nodes
495,924
Edges
8,415
Features
5
Classes
What Coauthor Physics contains
Coauthor Physics is a collaboration network from the Microsoft Academic Graph covering physics research. Each of the 34,493 nodes is a physics researcher. The 495,924 edges connect co-authors. Node features are 8,415-dimensional keyword vectors summarizing each author's publication topics. The 5 classes represent major physics subfields.
The physics community is more tightly connected than CS: the average degree is ~14.4, nearly double Coauthor CS (~8.9). Physics papers often have large author lists (especially in experimental physics), creating dense collaboration clusters. This density benefits GNN aggregation -- each author's neighborhood contains rich information about their research field.
Why Coauthor Physics matters
Coauthor Physics is valuable not for its difficulty (it is easy) but for what it teaches about the relationship between graph density, feature richness, and GNN performance. When both features and structure are informative, even simple GCNs achieve excellent results. This raises a practical question: when is the graph necessary? On Coauthor Physics, a feature-only model (MLP) already achieves ~90%. The graph adds 3-5 points. Is that gap worth the complexity of graph-based training?
For production systems, this tradeoff is critical. If features alone get you 90% of the way, graph-based methods must justify their additional complexity with clear accuracy or capability gains. Coauthor Physics provides a controlled setting to study this tradeoff.
Loading Coauthor Physics in PyG
from torch_geometric.datasets import Coauthor
dataset = Coauthor(root='/tmp/Coauthor', name='Physics')
data = dataset[0]
print(f"Nodes: {data.num_nodes}") # 34493
print(f"Edges: {data.num_edges}") # 495924
print(f"Features: {data.num_features}") # 8415
print(f"Classes: {dataset.num_classes}") # 5Same Coauthor API. No standard split -- create your own partition.
Original Paper
Pitfalls of Graph Neural Network Evaluation
Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, Stephan Gunnemann (2018). NeurIPS 2018 Workshop on Relational Representation Learning
Read paper →Benchmark comparison (random splits)
| Method | Accuracy | Year | Paper |
|---|---|---|---|
| MLP (no graph) | ~90.5% | -- | Baseline |
| GCN | ~93.6% | 2017 | Kipf & Welling |
| GAT | ~94.2% | 2018 | Velickovic et al. |
| GraphSAGE | ~93.1% | 2017 | Hamilton et al. |
| GCNII | ~95.3% | 2020 | Chen et al. |
Which Coauthor dataset should I use?
Coauthor CS (18,333 nodes, 15 classes, avg degree ~8.9) is sparser and has more classes, making it better for differentiating GNN architectures. Coauthor Physics (34,493 nodes, 5 classes, avg degree ~14.4) is denser, larger, and easier -- use it to test efficiency on high-dimensional features (8,415 dims). If you want a challenge, use CS. If you want a scalability and efficiency test, use Physics.
Common tasks and benchmarks
Node classification with researcher-created splits. GCN achieves ~93-95%, GAT ~94-95%, MLP (no graph) ~90-92%. The small gap between graph and non-graph methods reflects the high feature quality. Coauthor Physics is most useful for testing training efficiency: the 8,415-dimensional features create large weight matrices, and the 34K nodes push memory usage higher than Planetoid datasets.
Data source
The Coauthor datasets were introduced by Shchur et al. (2018) and are derived from the Microsoft Academic Graph. PyG downloads the processed version automatically.
BibTeX citation
@article{shchur2018pitfalls,
title={Pitfalls of Graph Neural Network Evaluation},
author={Shchur, Oleksandr and Mumme, Maximilian and Bojchevski, Aleksandar and G{\"u}nnemann, Stephan},
journal={arXiv preprint arXiv:1811.05868},
year={2018}
}Cite Shchur et al. for the Coauthor Physics benchmark dataset.
Example: research talent mapping
A research lab wants to identify potential hires by mapping the physics collaboration landscape. Which researchers work at the intersection of multiple subfields? Who collaborates across traditional boundaries? GNN embeddings on Coauthor Physics can answer these questions: researchers whose embeddings place them between field clusters are likely interdisciplinary collaborators. Companies like Google DeepMind and Meta AI use similar graph-based approaches for talent intelligence.
From benchmark to production
Production research networks include papers, institutions, funding agencies, and research topics as additional node types. Collaboration is temporal: a co-authorship from 2024 is more relevant than one from 2010. And the task expands beyond classification to link prediction (who will collaborate next?), influence propagation (which ideas will spread?), and anomaly detection (which collaborations are unusual?).