1,113
Graphs
~39
Avg Nodes
3
Node Features
2
Classes
What PROTEINS contains
PROTEINS is a dataset of 1,113 protein tertiary structure graphs. Each graph represents a protein, with nodes as secondary structure elements (helices, sheets, turns) and edges connecting elements that are spatially close or sequentially adjacent. Nodes have 3-dimensional features encoding structural properties. The binary classification task distinguishes enzymes (proteins that catalyze chemical reactions) from non-enzymes.
Graphs average 39 nodes and 146 edges, making them moderately sized. The 1,113 graphs are enough for 10-fold cross-validation with reasonable statistical power, addressing the high-variance problem that plagues smaller datasets like MUTAG (188 graphs).
Why PROTEINS matters
PROTEINS fills the gap between MUTAG (too small for reliable results) and production molecular datasets (too complex for learning purposes). With 1,113 graphs and a binary task, it is the right size for meaningful benchmarking of graph classification architectures without requiring specialized molecular ML knowledge.
The binary enzyme/non-enzyme task also tests a fundamental biological question: can you predict what a protein does from how it folds? This structure-function relationship is central to computational biology. GNNs learn that certain structural motifs (like specific arrangements of helices near a binding pocket) are predictive of enzyme function.
Loading PROTEINS in PyG
from torch_geometric.datasets import TUDataset
from torch_geometric.loader import DataLoader
dataset = TUDataset(root='/tmp/PROTEINS', name='PROTEINS')
print(f"Graphs: {len(dataset)}") # 1113
print(f"Features: {dataset.num_features}") # 3
print(f"Classes: {dataset.num_classes}") # 2
loader = DataLoader(dataset, batch_size=64, shuffle=True)Standard TUDataset API. Larger batch sizes work well due to moderate graph sizes.
Original Paper
Protein Function Prediction via Graph Kernels
Karsten M. Borgwardt, Cheng Soon Ong, Stefan Schonauer, S.V.N. Vishwanathan, Alex J. Smola, Hans-Peter Kriegel (2005). Bioinformatics, 21(Suppl 1), i47-i56
Benchmark comparison (10-fold cross-validation)
| Method | Accuracy | Year | Paper |
|---|---|---|---|
| WL kernel | ~75.0% | 2011 | Shervashidze et al. |
| GCN + mean pool | ~72.7% | 2017 | Kipf & Welling |
| GraphSAGE | ~72.5% | 2017 | Hamilton et al. |
| GIN | 76.2% | 2019 | Xu et al. |
| PNA | ~76.5% | 2020 | Corso et al. |
Which graph classification dataset should I use?
MUTAG (188 graphs, 2 classes) is tiny and fast -- a sanity check only. ENZYMES (600 graphs, 6 classes) is the hardest, with ~50% accuracy, and best for testing expressiveness. PROTEINS (1,113 graphs, 2 classes) is the largest and most statistically reliable for binary graph classification. Use PROTEINS as your primary benchmark for graph classification, ENZYMES to stress-test expressiveness, and MUTAG only as a quick debug check.
Common tasks and benchmarks
Binary graph classification with 10-fold cross-validation. GCN with global mean pooling: ~72-75%. GIN: ~73-76%. PNA: ~74-77%. GraphSAGE: ~72-75%. The narrow spread between methods suggests that the structural features are the bottleneck, not the GNN architecture. More expressive readout functions (Set2Set, attention-based pooling) sometimes help more than changing the message-passing layer.
Data source
PROTEINS is part of the TUDataset collection. The original data comes from protein structure databases. The graph version is available from the TUDataset benchmark suite. PyG downloads it automatically via the TUDataset loader.
BibTeX citation
@article{borgwardt2005protein,
title={Protein Function Prediction via Graph Kernels},
author={Borgwardt, Karsten M. and Ong, Cheng Soon and Sch{\"o}nauer, Stefan and Vishwanathan, S.V.N. and Smola, Alex J. and Kriegel, Hans-Peter},
journal={Bioinformatics},
volume={21},
number={Suppl 1},
pages={i47--i56},
year={2005}
}
@article{morris2020tudataset,
title={TUDataset: A collection of benchmark datasets for learning with graphs},
author={Morris, Christopher and Kriege, Nils M. and Bause, Franka and Kersting, Kristian and Mutzel, Petra and Neumann, Marion},
journal={arXiv preprint arXiv:2007.08663},
year={2020}
}Cite Borgwardt et al. for PROTEINS, Morris et al. for the TUDataset benchmark collection.
Example: protein engineering
Biotech companies design novel proteins for therapeutic and industrial applications: enzymes that break down pollutants, antibodies that target cancer cells, or catalysts for chemical manufacturing. Predicting whether a designed protein will function as an enzyme (or some other functional class) from its predicted structure accelerates the design-build-test cycle from months to days. PROTEINS benchmarks the core classification step.
From benchmark to production
Production protein classification uses orders of magnitude more information: amino acid sequences (1D), predicted 3D coordinates (from AlphaFold), evolutionary conservation scores, and databases with millions of characterized proteins. The graph representation captures spatial relationships that sequence alone misses, but combining graph structure with sequence information yields the best production results.