What is the Coauthor Physics dataset?

Coauthor Physics is an academic collaboration network of 34,493 physics researchers. Edges (495,924) connect co-authors. Features are 8,415-dimensional keyword vectors. The task is to classify authors into 5 physics subfields.

How does Coauthor Physics compare to Coauthor CS?

Physics is larger (34K vs 18K nodes, 496K vs 164K edges) with higher-dimensional features (8,415 vs 6,805) but fewer classes (5 vs 15). The physics community is more tightly connected, with higher average degree (~14.4 vs ~8.9).

What accuracy should I expect on Coauthor Physics?

GCN achieves ~93-95% accuracy. The combination of high-dimensional features, dense graph structure, and only 5 classes makes this one of the easier benchmarks. It is more useful for testing scalability than for differentiating architectures.

Why does Physics have higher accuracy than CS?

Fewer classes (5 vs 15) means an easier classification task. The denser graph (avg degree 14.4 vs 8.9) provides more neighborhood information. And the higher-dimensional features (8,415) give richer author profiles. All three factors contribute to higher accuracy.

Coauthor Physics Dataset: Physics Collaboration Network | PyG Guide

Q: How do I load Coauthor Physics in PyTorch Geometric?

Use `from torch_geometric.datasets import Coauthor; dataset = Coauthor(root='/tmp/Coauthor', name='Physics')`. Same API as Coauthor CS.

34,493

Nodes

495,924

Edges

8,415

Features

Classes

What Coauthor Physics contains

Coauthor Physics is a collaboration network from the Microsoft Academic Graph covering physics research. Each of the 34,493 nodes is a physics researcher. The 495,924 edges connect co-authors. Node features are 8,415-dimensional keyword vectors summarizing each author's publication topics. The 5 classes represent major physics subfields.

The physics community is more tightly connected than CS: the average degree is ~14.4, nearly double Coauthor CS (~8.9). Physics papers often have large author lists (especially in experimental physics), creating dense collaboration clusters. This density benefits GNN aggregation -- each author's neighborhood contains rich information about their research field.

Why Coauthor Physics matters

Coauthor Physics is valuable not for its difficulty (it is easy) but for what it teaches about the relationship between graph density, feature richness, and GNN performance. When both features and structure are informative, even simple GCNs achieve excellent results. This raises a practical question: when is the graph necessary? On Coauthor Physics, a feature-only model (MLP) already achieves ~90%. The graph adds 3-5 points. Is that gap worth the complexity of graph-based training?

For production systems, this tradeoff is critical. If features alone get you 90% of the way, graph-based methods must justify their additional complexity with clear accuracy or capability gains. Coauthor Physics provides a controlled setting to study this tradeoff.

Loading Coauthor Physics in PyG

load_coauthor_physics.py

from torch_geometric.datasets import Coauthor

dataset = Coauthor(root='/tmp/Coauthor', name='Physics')
data = dataset[0]

print(f"Nodes: {data.num_nodes}")        # 34493
print(f"Edges: {data.num_edges}")        # 495924
print(f"Features: {data.num_features}")  # 8415
print(f"Classes: {dataset.num_classes}") # 5

Same Coauthor API. No standard split -- create your own partition.

Original Paper

Pitfalls of Graph Neural Network Evaluation

Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, Stephan Gunnemann (2018). NeurIPS 2018 Workshop on Relational Representation Learning

Read paper →

Benchmark comparison (random splits)

Method	Accuracy	Year	Paper
MLP (no graph)	~90.5%	--	Baseline
GCN	~93.6%	2017	Kipf & Welling
GAT	~94.2%	2018	Velickovic et al.
GraphSAGE	~93.1%	2017	Hamilton et al.
GCNII	~95.3%	2020	Chen et al.

Which Coauthor dataset should I use?

Coauthor CS (18,333 nodes, 15 classes, avg degree ~8.9) is sparser and has more classes, making it better for differentiating GNN architectures. Coauthor Physics (34,493 nodes, 5 classes, avg degree ~14.4) is denser, larger, and easier -- use it to test efficiency on high-dimensional features (8,415 dims). If you want a challenge, use CS. If you want a scalability and efficiency test, use Physics.

Common tasks and benchmarks

Node classification with researcher-created splits. GCN achieves ~93-95%, GAT ~94-95%, MLP (no graph) ~90-92%. The small gap between graph and non-graph methods reflects the high feature quality. Coauthor Physics is most useful for testing training efficiency: the 8,415-dimensional features create large weight matrices, and the 34K nodes push memory usage higher than Planetoid datasets.

Data source

The Coauthor datasets were introduced by Shchur et al. (2018) and are derived from the Microsoft Academic Graph. PyG downloads the processed version automatically.

BibTeX citation

coauthor_physics.bib

@article{shchur2018pitfalls,
  title={Pitfalls of Graph Neural Network Evaluation},
  author={Shchur, Oleksandr and Mumme, Maximilian and Bojchevski, Aleksandar and G{\"u}nnemann, Stephan},
  journal={arXiv preprint arXiv:1811.05868},
  year={2018}
}

Cite Shchur et al. for the Coauthor Physics benchmark dataset.

Example: research talent mapping

A research lab wants to identify potential hires by mapping the physics collaboration landscape. Which researchers work at the intersection of multiple subfields? Who collaborates across traditional boundaries? GNN embeddings on Coauthor Physics can answer these questions: researchers whose embeddings place them between field clusters are likely interdisciplinary collaborators. Companies like Google DeepMind and Meta AI use similar graph-based approaches for talent intelligence.

From benchmark to production

Production research networks include papers, institutions, funding agencies, and research topics as additional node types. Collaboration is temporal: a co-authorship from 2024 is more relevant than one from 2010. And the task expands beyond classification to link prediction (who will collaborate next?), influence propagation (which ideas will spread?), and anomaly detection (which collaborations are unusual?).

Coauthor Physics: A Dense, Feature-Rich Collaboration Network