Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Dataset6 min read

PROTEINS: Binary Protein Classification with 1,113 Graphs

PROTEINS is the largest of the classic TUDataset graph classification benchmarks. With 1,113 protein structure graphs and a binary enzyme/non-enzyme task, it provides more statistically reliable results than MUTAG or ENZYMES while testing the same graph-level prediction capabilities.

PyTorch Geometric

TL;DR

  • 1PROTEINS has 1,113 protein graphs averaging 39 nodes and 146 edges. Node features are 3-dimensional. The binary task classifies proteins as enzymes or non-enzymes.
  • 2Larger than MUTAG (188) and ENZYMES (600), PROTEINS provides more reliable evaluation for graph classification methods.
  • 3GCN achieves ~72-75%, GIN ~73-76%. Moderate accuracy reflects the inherent difficulty of predicting function from secondary structure alone.
  • 4PROTEINS demonstrates protein function prediction from structure, the same task that drives billion-dollar drug discovery pipelines.

1,113

Graphs

~39

Avg Nodes

3

Node Features

2

Classes

What PROTEINS contains

PROTEINS is a dataset of 1,113 protein tertiary structure graphs. Each graph represents a protein, with nodes as secondary structure elements (helices, sheets, turns) and edges connecting elements that are spatially close or sequentially adjacent. Nodes have 3-dimensional features encoding structural properties. The binary classification task distinguishes enzymes (proteins that catalyze chemical reactions) from non-enzymes.

Graphs average 39 nodes and 146 edges, making them moderately sized. The 1,113 graphs are enough for 10-fold cross-validation with reasonable statistical power, addressing the high-variance problem that plagues smaller datasets like MUTAG (188 graphs).

Why PROTEINS matters

PROTEINS fills the gap between MUTAG (too small for reliable results) and production molecular datasets (too complex for learning purposes). With 1,113 graphs and a binary task, it is the right size for meaningful benchmarking of graph classification architectures without requiring specialized molecular ML knowledge.

The binary enzyme/non-enzyme task also tests a fundamental biological question: can you predict what a protein does from how it folds? This structure-function relationship is central to computational biology. GNNs learn that certain structural motifs (like specific arrangements of helices near a binding pocket) are predictive of enzyme function.

Loading PROTEINS in PyG

load_proteins.py
from torch_geometric.datasets import TUDataset
from torch_geometric.loader import DataLoader

dataset = TUDataset(root='/tmp/PROTEINS', name='PROTEINS')
print(f"Graphs: {len(dataset)}")          # 1113
print(f"Features: {dataset.num_features}")  # 3
print(f"Classes: {dataset.num_classes}")    # 2

loader = DataLoader(dataset, batch_size=64, shuffle=True)

Standard TUDataset API. Larger batch sizes work well due to moderate graph sizes.

Original Paper

Protein Function Prediction via Graph Kernels

Karsten M. Borgwardt, Cheng Soon Ong, Stefan Schonauer, S.V.N. Vishwanathan, Alex J. Smola, Hans-Peter Kriegel (2005). Bioinformatics, 21(Suppl 1), i47-i56

Benchmark comparison (10-fold cross-validation)

MethodAccuracyYearPaper
WL kernel~75.0%2011Shervashidze et al.
GCN + mean pool~72.7%2017Kipf & Welling
GraphSAGE~72.5%2017Hamilton et al.
GIN76.2%2019Xu et al.
PNA~76.5%2020Corso et al.

Which graph classification dataset should I use?

MUTAG (188 graphs, 2 classes) is tiny and fast -- a sanity check only. ENZYMES (600 graphs, 6 classes) is the hardest, with ~50% accuracy, and best for testing expressiveness. PROTEINS (1,113 graphs, 2 classes) is the largest and most statistically reliable for binary graph classification. Use PROTEINS as your primary benchmark for graph classification, ENZYMES to stress-test expressiveness, and MUTAG only as a quick debug check.

Common tasks and benchmarks

Binary graph classification with 10-fold cross-validation. GCN with global mean pooling: ~72-75%. GIN: ~73-76%. PNA: ~74-77%. GraphSAGE: ~72-75%. The narrow spread between methods suggests that the structural features are the bottleneck, not the GNN architecture. More expressive readout functions (Set2Set, attention-based pooling) sometimes help more than changing the message-passing layer.

Data source

PROTEINS is part of the TUDataset collection. The original data comes from protein structure databases. The graph version is available from the TUDataset benchmark suite. PyG downloads it automatically via the TUDataset loader.

BibTeX citation

proteins.bib
@article{borgwardt2005protein,
  title={Protein Function Prediction via Graph Kernels},
  author={Borgwardt, Karsten M. and Ong, Cheng Soon and Sch{\"o}nauer, Stefan and Vishwanathan, S.V.N. and Smola, Alex J. and Kriegel, Hans-Peter},
  journal={Bioinformatics},
  volume={21},
  number={Suppl 1},
  pages={i47--i56},
  year={2005}
}

@article{morris2020tudataset,
  title={TUDataset: A collection of benchmark datasets for learning with graphs},
  author={Morris, Christopher and Kriege, Nils M. and Bause, Franka and Kersting, Kristian and Mutzel, Petra and Neumann, Marion},
  journal={arXiv preprint arXiv:2007.08663},
  year={2020}
}

Cite Borgwardt et al. for PROTEINS, Morris et al. for the TUDataset benchmark collection.

Example: protein engineering

Biotech companies design novel proteins for therapeutic and industrial applications: enzymes that break down pollutants, antibodies that target cancer cells, or catalysts for chemical manufacturing. Predicting whether a designed protein will function as an enzyme (or some other functional class) from its predicted structure accelerates the design-build-test cycle from months to days. PROTEINS benchmarks the core classification step.

From benchmark to production

Production protein classification uses orders of magnitude more information: amino acid sequences (1D), predicted 3D coordinates (from AlphaFold), evolutionary conservation scores, and databases with millions of characterized proteins. The graph representation captures spatial relationships that sequence alone misses, but combining graph structure with sequence information yields the best production results.

Frequently asked questions

What is the PROTEINS dataset?

PROTEINS contains 1,113 protein graphs where nodes (avg 39) represent secondary structure elements and edges (avg 146) represent spatial or sequential adjacency. Each graph has 3-dimensional node features. The binary task classifies proteins as enzymes or non-enzymes.

How do I load PROTEINS in PyTorch Geometric?

Use `from torch_geometric.datasets import TUDataset; dataset = TUDataset(root='/tmp/PROTEINS', name='PROTEINS')`. Standard TUDataset API with DataLoader for batching.

How does PROTEINS compare to ENZYMES?

PROTEINS has more graphs (1,113 vs 600), slightly larger graphs (avg 39 vs 33 nodes), but only 2 classes (enzyme vs non-enzyme) compared to ENZYMES' 6 classes. The binary task is easier, but the larger dataset provides more reliable evaluation.

What accuracy should I expect on PROTEINS?

GCN achieves ~72-75%, GIN ~73-76%. Results above 78% are competitive. The binary classification is easier than ENZYMES' 6-class task, but accuracy is still moderate because structural features alone are noisy predictors of enzyme function.

Is PROTEINS useful for drug discovery?

PROTEINS demonstrates the concept of predicting protein function from structure, which is core to drug discovery. However, production protein classification uses much richer features (amino acid sequences, 3D coordinates, evolutionary data) and larger datasets. PROTEINS is a teaching tool, not a production benchmark.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.