Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Dataset6 min read

Coauthor Physics: A Dense, Feature-Rich Collaboration Network

Coauthor Physics maps 34,493 physics researchers and their collaboration patterns. With 8,415-dimensional features and a tightly connected graph, it shows what happens when GNNs have abundant information: accuracies soar past 93%, shifting the focus from accuracy to efficiency.

PyTorch Geometric

TL;DR

  • 1Coauthor Physics has 34,493 author nodes, 495,924 co-authorship edges, 8,415 keyword features, and 5 physics subfield labels.
  • 2The dense graph (avg degree ~14.4) and rich features make this an easy classification task. GCN achieves 93-95%. The challenge is efficiency, not accuracy.
  • 3With 8,415 features per node, this is one of the highest-dimensional standard GNN benchmarks. Feature reduction vs. full-dimensional training is a meaningful design choice.
  • 4Physics collaboration is tightly knit: the community is smaller and more interconnected than CS. This high connectivity benefits simple aggregation methods.

34,493

Nodes

495,924

Edges

8,415

Features

5

Classes

What Coauthor Physics contains

Coauthor Physics is a collaboration network from the Microsoft Academic Graph covering physics research. Each of the 34,493 nodes is a physics researcher. The 495,924 edges connect co-authors. Node features are 8,415-dimensional keyword vectors summarizing each author's publication topics. The 5 classes represent major physics subfields.

The physics community is more tightly connected than CS: the average degree is ~14.4, nearly double Coauthor CS (~8.9). Physics papers often have large author lists (especially in experimental physics), creating dense collaboration clusters. This density benefits GNN aggregation -- each author's neighborhood contains rich information about their research field.

Why Coauthor Physics matters

Coauthor Physics is valuable not for its difficulty (it is easy) but for what it teaches about the relationship between graph density, feature richness, and GNN performance. When both features and structure are informative, even simple GCNs achieve excellent results. This raises a practical question: when is the graph necessary? On Coauthor Physics, a feature-only model (MLP) already achieves ~90%. The graph adds 3-5 points. Is that gap worth the complexity of graph-based training?

For production systems, this tradeoff is critical. If features alone get you 90% of the way, graph-based methods must justify their additional complexity with clear accuracy or capability gains. Coauthor Physics provides a controlled setting to study this tradeoff.

Loading Coauthor Physics in PyG

load_coauthor_physics.py
from torch_geometric.datasets import Coauthor

dataset = Coauthor(root='/tmp/Coauthor', name='Physics')
data = dataset[0]

print(f"Nodes: {data.num_nodes}")        # 34493
print(f"Edges: {data.num_edges}")        # 495924
print(f"Features: {data.num_features}")  # 8415
print(f"Classes: {dataset.num_classes}") # 5

Same Coauthor API. No standard split -- create your own partition.

Original Paper

Pitfalls of Graph Neural Network Evaluation

Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, Stephan Gunnemann (2018). NeurIPS 2018 Workshop on Relational Representation Learning

Read paper →

Benchmark comparison (random splits)

MethodAccuracyYearPaper
MLP (no graph)~90.5%--Baseline
GCN~93.6%2017Kipf & Welling
GAT~94.2%2018Velickovic et al.
GraphSAGE~93.1%2017Hamilton et al.
GCNII~95.3%2020Chen et al.

Which Coauthor dataset should I use?

Coauthor CS (18,333 nodes, 15 classes, avg degree ~8.9) is sparser and has more classes, making it better for differentiating GNN architectures. Coauthor Physics (34,493 nodes, 5 classes, avg degree ~14.4) is denser, larger, and easier -- use it to test efficiency on high-dimensional features (8,415 dims). If you want a challenge, use CS. If you want a scalability and efficiency test, use Physics.

Common tasks and benchmarks

Node classification with researcher-created splits. GCN achieves ~93-95%, GAT ~94-95%, MLP (no graph) ~90-92%. The small gap between graph and non-graph methods reflects the high feature quality. Coauthor Physics is most useful for testing training efficiency: the 8,415-dimensional features create large weight matrices, and the 34K nodes push memory usage higher than Planetoid datasets.

Data source

The Coauthor datasets were introduced by Shchur et al. (2018) and are derived from the Microsoft Academic Graph. PyG downloads the processed version automatically.

BibTeX citation

coauthor_physics.bib
@article{shchur2018pitfalls,
  title={Pitfalls of Graph Neural Network Evaluation},
  author={Shchur, Oleksandr and Mumme, Maximilian and Bojchevski, Aleksandar and G{\"u}nnemann, Stephan},
  journal={arXiv preprint arXiv:1811.05868},
  year={2018}
}

Cite Shchur et al. for the Coauthor Physics benchmark dataset.

Example: research talent mapping

A research lab wants to identify potential hires by mapping the physics collaboration landscape. Which researchers work at the intersection of multiple subfields? Who collaborates across traditional boundaries? GNN embeddings on Coauthor Physics can answer these questions: researchers whose embeddings place them between field clusters are likely interdisciplinary collaborators. Companies like Google DeepMind and Meta AI use similar graph-based approaches for talent intelligence.

From benchmark to production

Production research networks include papers, institutions, funding agencies, and research topics as additional node types. Collaboration is temporal: a co-authorship from 2024 is more relevant than one from 2010. And the task expands beyond classification to link prediction (who will collaborate next?), influence propagation (which ideas will spread?), and anomaly detection (which collaborations are unusual?).

Frequently asked questions

What is the Coauthor Physics dataset?

Coauthor Physics is an academic collaboration network of 34,493 physics researchers. Edges (495,924) connect co-authors. Features are 8,415-dimensional keyword vectors. The task is to classify authors into 5 physics subfields.

How does Coauthor Physics compare to Coauthor CS?

Physics is larger (34K vs 18K nodes, 496K vs 164K edges) with higher-dimensional features (8,415 vs 6,805) but fewer classes (5 vs 15). The physics community is more tightly connected, with higher average degree (~14.4 vs ~8.9).

How do I load Coauthor Physics in PyTorch Geometric?

Use `from torch_geometric.datasets import Coauthor; dataset = Coauthor(root='/tmp/Coauthor', name='Physics')`. Same API as Coauthor CS.

What accuracy should I expect on Coauthor Physics?

GCN achieves ~93-95% accuracy. The combination of high-dimensional features, dense graph structure, and only 5 classes makes this one of the easier benchmarks. It is more useful for testing scalability than for differentiating architectures.

Why does Physics have higher accuracy than CS?

Fewer classes (5 vs 15) means an easier classification task. The denser graph (avg degree 14.4 vs 8.9) provides more neighborhood information. And the higher-dimensional features (8,415) give richer author profiles. All three factors contribute to higher accuracy.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.