What is the PROTEINS dataset?

PROTEINS contains 1,113 protein graphs where nodes (avg 39) represent secondary structure elements and edges (avg 146) represent spatial or sequential adjacency. Each graph has 3-dimensional node features. The binary task classifies proteins as enzymes or non-enzymes.

How does PROTEINS compare to ENZYMES?

PROTEINS has more graphs (1,113 vs 600), slightly larger graphs (avg 39 vs 33 nodes), but only 2 classes (enzyme vs non-enzyme) compared to ENZYMES' 6 classes. The binary task is easier, but the larger dataset provides more reliable evaluation.

What accuracy should I expect on PROTEINS?

GCN achieves ~72-75%, GIN ~73-76%. Results above 78% are competitive. The binary classification is easier than ENZYMES' 6-class task, but accuracy is still moderate because structural features alone are noisy predictors of enzyme function.

Is PROTEINS useful for drug discovery?

PROTEINS demonstrates the concept of predicting protein function from structure, which is core to drug discovery. However, production protein classification uses much richer features (amino acid sequences, 3D coordinates, evolutionary data) and larger datasets. PROTEINS is a teaching tool, not a production benchmark.

PROTEINS Dataset: Protein Graph Classification | PyG Guide

Q: How do I load PROTEINS in PyTorch Geometric?

Use `from torch_geometric.datasets import TUDataset; dataset = TUDataset(root='/tmp/PROTEINS', name='PROTEINS')`. Standard TUDataset API with DataLoader for batching.

1,113

Graphs

~39

Avg Nodes

Node Features

Classes

What PROTEINS contains

PROTEINS is a dataset of 1,113 protein tertiary structure graphs. Each graph represents a protein, with nodes as secondary structure elements (helices, sheets, turns) and edges connecting elements that are spatially close or sequentially adjacent. Nodes have 3-dimensional features encoding structural properties. The binary classification task distinguishes enzymes (proteins that catalyze chemical reactions) from non-enzymes.

Graphs average 39 nodes and 146 edges, making them moderately sized. The 1,113 graphs are enough for 10-fold cross-validation with reasonable statistical power, addressing the high-variance problem that plagues smaller datasets like MUTAG (188 graphs).

Why PROTEINS matters

PROTEINS fills the gap between MUTAG (too small for reliable results) and production molecular datasets (too complex for learning purposes). With 1,113 graphs and a binary task, it is the right size for meaningful benchmarking of graph classification architectures without requiring specialized molecular ML knowledge.

The binary enzyme/non-enzyme task also tests a fundamental biological question: can you predict what a protein does from how it folds? This structure-function relationship is central to computational biology. GNNs learn that certain structural motifs (like specific arrangements of helices near a binding pocket) are predictive of enzyme function.

Loading PROTEINS in PyG

load_proteins.py

from torch_geometric.datasets import TUDataset
from torch_geometric.loader import DataLoader

dataset = TUDataset(root='/tmp/PROTEINS', name='PROTEINS')
print(f"Graphs: {len(dataset)}")          # 1113
print(f"Features: {dataset.num_features}")  # 3
print(f"Classes: {dataset.num_classes}")    # 2

loader = DataLoader(dataset, batch_size=64, shuffle=True)

Standard TUDataset API. Larger batch sizes work well due to moderate graph sizes.

Original Paper

Protein Function Prediction via Graph Kernels

Karsten M. Borgwardt, Cheng Soon Ong, Stefan Schonauer, S.V.N. Vishwanathan, Alex J. Smola, Hans-Peter Kriegel (2005). Bioinformatics, 21(Suppl 1), i47-i56

Benchmark comparison (10-fold cross-validation)

Method	Accuracy	Year	Paper
WL kernel	~75.0%	2011	Shervashidze et al.
GCN + mean pool	~72.7%	2017	Kipf & Welling
GraphSAGE	~72.5%	2017	Hamilton et al.
GIN	76.2%	2019	Xu et al.
PNA	~76.5%	2020	Corso et al.

Which graph classification dataset should I use?

MUTAG (188 graphs, 2 classes) is tiny and fast -- a sanity check only. ENZYMES (600 graphs, 6 classes) is the hardest, with ~50% accuracy, and best for testing expressiveness. PROTEINS (1,113 graphs, 2 classes) is the largest and most statistically reliable for binary graph classification. Use PROTEINS as your primary benchmark for graph classification, ENZYMES to stress-test expressiveness, and MUTAG only as a quick debug check.

Common tasks and benchmarks

Binary graph classification with 10-fold cross-validation. GCN with global mean pooling: ~72-75%. GIN: ~73-76%. PNA: ~74-77%. GraphSAGE: ~72-75%. The narrow spread between methods suggests that the structural features are the bottleneck, not the GNN architecture. More expressive readout functions (Set2Set, attention-based pooling) sometimes help more than changing the message-passing layer.

Data source

PROTEINS is part of the TUDataset collection. The original data comes from protein structure databases. The graph version is available from the TUDataset benchmark suite. PyG downloads it automatically via the TUDataset loader.

BibTeX citation

proteins.bib

@article{borgwardt2005protein,
  title={Protein Function Prediction via Graph Kernels},
  author={Borgwardt, Karsten M. and Ong, Cheng Soon and Sch{\"o}nauer, Stefan and Vishwanathan, S.V.N. and Smola, Alex J. and Kriegel, Hans-Peter},
  journal={Bioinformatics},
  volume={21},
  number={Suppl 1},
  pages={i47--i56},
  year={2005}
}

@article{morris2020tudataset,
  title={TUDataset: A collection of benchmark datasets for learning with graphs},
  author={Morris, Christopher and Kriege, Nils M. and Bause, Franka and Kersting, Kristian and Mutzel, Petra and Neumann, Marion},
  journal={arXiv preprint arXiv:2007.08663},
  year={2020}
}

Cite Borgwardt et al. for PROTEINS, Morris et al. for the TUDataset benchmark collection.

Example: protein engineering

Biotech companies design novel proteins for therapeutic and industrial applications: enzymes that break down pollutants, antibodies that target cancer cells, or catalysts for chemical manufacturing. Predicting whether a designed protein will function as an enzyme (or some other functional class) from its predicted structure accelerates the design-build-test cycle from months to days. PROTEINS benchmarks the core classification step.

From benchmark to production

Production protein classification uses orders of magnitude more information: amino acid sequences (1D), predicted 3D coordinates (from AlphaFold), evolutionary conservation scores, and databases with millions of characterized proteins. The graph representation captures spatial relationships that sequence alone misses, but combining graph structure with sequence information yields the best production results.

PROTEINS: Binary Protein Classification with 1,113 Graphs