Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Dataset6 min read

Coauthor CS: Learning Research Fields from Collaboration Patterns

Coauthor CS maps the collaboration structure of 18,333 computer science researchers. Co-authorship is one of the strongest signals of shared research interest, and GNNs exploit this structure to predict research fields with over 91% accuracy.

PyTorch Geometric

TL;DR

  • 1Coauthor CS has 18,333 author nodes, 163,788 co-authorship edges, 6,805 keyword features, and 15 CS research field labels.
  • 2Co-authorship is a strong signal: researchers who publish together almost always work in the same field. This high homophily makes the task relatively easy for GNNs (91%+ accuracy).
  • 3The 6,805-dimensional features are unusually rich, providing detailed research topic profiles that help even without graph structure.
  • 4Coauthor networks model professional relationships -- the same structure that drives enterprise org graphs, partner networks, and collaboration platforms.

18,333

Nodes

163,788

Edges

6,805

Features

15

Classes

What Coauthor CS contains

Coauthor CS is an academic collaboration network from the Microsoft Academic Graph. Each of the 18,333 nodes represents a computer science researcher. An edge connects two researchers if they co-authored at least one paper. Node features are 6,805-dimensional keyword vectors representing the research topics an author has published on. The 15 classes correspond to CS research fields (AI, databases, theory, systems, etc.).

The dataset exhibits strong homophily: co-authors usually work in the same field. An AI researcher co-authors with other AI researchers, a database researcher with other database researchers. This makes neighborhood aggregation highly informative -- a node's co-authors are excellent predictors of its own field.

Why Coauthor CS matters

Coauthor CS represents a different kind of graph than citations or co-purchases. It models professional relationships between people. This structure appears throughout enterprise: employee collaboration networks, business partner ecosystems, advisor-client relationships, and cross-departmental project teams. GNNs that work on Coauthor CS can be adapted for organizational analytics: predicting team performance, identifying key collaborators, or detecting organizational silos.

The high-dimensional features (6,805) also make Coauthor CS useful for studying the interaction between features and structure. Models can achieve reasonable accuracy using features alone (the keywords are very informative). Adding graph structure improves accuracy further, but the margin is smaller than on Cora where features are less descriptive.

Loading Coauthor CS in PyG

load_coauthor_cs.py
from torch_geometric.datasets import Coauthor

dataset = Coauthor(root='/tmp/Coauthor', name='CS')
data = dataset[0]

print(f"Nodes: {data.num_nodes}")        # 18333
print(f"Edges: {data.num_edges}")        # 163788
print(f"Features: {data.num_features}")  # 6805
print(f"Classes: {dataset.num_classes}") # 15
print(f"Avg degree: {data.num_edges / data.num_nodes:.1f}")  # ~8.9

No standard split provided. Create random or stratified train/val/test masks.

Original Paper

Pitfalls of Graph Neural Network Evaluation

Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, Stephan Gunnemann (2018). NeurIPS 2018 Workshop on Relational Representation Learning

Read paper →

Benchmark comparison (random 20/30/50 splits)

MethodAccuracyYearPaper
MLP (no graph)~88.5%--Baseline
GCN~91.1%2017Kipf & Welling
GAT~92.3%2018Velickovic et al.
GraphSAGE~91.5%2017Hamilton et al.
GCNII~93.0%2020Chen et al.

Which Coauthor dataset should I use?

Coauthor CS (18,333 nodes, 15 classes) has more classes and sparser structure (avg degree ~8.9) -- use it to differentiate architectures. Coauthor Physics (34,493 nodes, 5 classes) is larger and denser (avg degree ~14.4) but with fewer classes, making it easier. For comparison with citation networks, Coauthor CS is most directly comparable to PubMed (similar scale, both academic graphs) but models people rather than papers.

Common tasks and benchmarks

Node classification is the primary task: predict each author's research field. Without a standard split, results vary by partition strategy. With a typical 20/30/50 train/val/test split, GCN achieves ~91-93%, GAT ~92-93%, and GraphSAGE ~91-93%. The high baseline across all methods reflects the strong homophily in collaboration networks.

Link prediction is also natural: predict future co-authorships. This maps to collaboration recommendation -- suggesting potential co-authors based on research overlap and network proximity.

Data source

The Coauthor datasets were introduced by Shchur et al. (2018) and are derived from the Microsoft Academic Graph. PyG downloads the processed version automatically.

BibTeX citation

coauthor_cs.bib
@article{shchur2018pitfalls,
  title={Pitfalls of Graph Neural Network Evaluation},
  author={Shchur, Oleksandr and Mumme, Maximilian and Bojchevski, Aleksandar and G{\"u}nnemann, Stephan},
  journal={arXiv preprint arXiv:1811.05868},
  year={2018}
}

Cite Shchur et al. for the Coauthor CS benchmark dataset.

Example: organizational analytics

A large enterprise wants to understand its internal collaboration patterns. Employees are nodes, joint projects are edges, and skills are features. The task: classify employees into functional groups and identify cross-functional collaborators who bridge organizational silos. This is structurally identical to Coauthor CS, with employees replacing researchers and business units replacing research fields.

From benchmark to production

Production collaboration networks are dynamic (people join, leave, change roles), heterogeneous (employees, contractors, partners, clients), and multi-relational (co-authored, reviewed, managed, mentored). Coauthor CS is a static, homogeneous snapshot. Bridging this gap requires temporal modeling, heterogeneous message passing, and continuous updating as the graph evolves.

Frequently asked questions

What is the Coauthor CS dataset?

Coauthor CS is an academic collaboration network where 18,333 nodes represent CS authors. Edges (163,788) connect authors who co-authored a paper. Features are 6,805-dimensional keyword vectors representing research topics. The task is to classify authors into 15 research fields.

How do I load Coauthor CS in PyTorch Geometric?

Use `from torch_geometric.datasets import Coauthor; dataset = Coauthor(root='/tmp/Coauthor', name='CS')`. The dataset has no standard split -- create your own train/val/test partition.

How does Coauthor CS compare to citation networks?

Coauthor networks connect people (authors), while citation networks connect documents (papers). Coauthor edges represent collaboration, a stronger signal of shared research interest than citations. The feature dimensionality is also much higher (6,805 vs 1,433 in Cora).

Why are the features 6,805-dimensional?

Features are keyword vectors from the authors' papers. The high dimensionality reflects the large vocabulary of CS research topics. Each dimension indicates whether an author has published papers using that keyword, creating a rich profile of research interests.

What is a good accuracy on Coauthor CS?

GCN achieves ~91-93% accuracy depending on the split. The high baseline reflects strong homophily: co-authors tend to work in the same field. GAT and GraphSAGE achieve similar results. The dataset is more useful for testing efficiency than for differentiating architectures.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.