What is the ENZYMES dataset?

ENZYMES is a collection of 600 protein tertiary structure graphs from the BRENDA enzyme database. Each graph represents an enzyme with nodes (avg 33) as secondary structure elements and edges (avg 124) as spatial or sequential connections. The 6-class task is to classify enzymes by their EC top-level number.

How do I load ENZYMES in PyTorch Geometric?

Use `from torch_geometric.datasets import TUDataset; dataset = TUDataset(root='/tmp/ENZYMES', name='ENZYMES')`. Each element is a separate enzyme graph.

How does ENZYMES compare to MUTAG?

ENZYMES has more graphs (600 vs 188), more classes (6 vs 2), slightly larger graphs (avg 33 nodes vs 18), and lower node feature dimension (3 vs 7). The multi-class task and larger dataset make ENZYMES a harder and more meaningful benchmark than MUTAG.

What accuracy should I expect on ENZYMES?

GCN achieves ~40-50% accuracy, GIN ~50-60%. Random guessing would give ~16.7% (1/6 classes). ENZYMES is considered a hard graph classification benchmark. Results above 65% are competitive. Always use 10-fold cross-validation.

Why is ENZYMES accuracy so low compared to MUTAG?

ENZYMES has 6 classes (vs 2), lower-dimensional features (3 vs 7), and the enzyme classes are structurally subtle -- different enzyme types can have similar tertiary structures. The task requires the GNN to learn fine-grained structural patterns that simple message passing may miss.

ENZYMES Dataset: Enzyme Graph Classification | PyG Guide

600

Graphs

~33

Avg Nodes

Node Features

Classes

What ENZYMES contains

ENZYMES contains 600 protein tertiary structure graphs from the BRENDA enzyme database. Each graph represents one enzyme. Nodes are secondary structure elements (helices, sheets, turns) with 3 node attributes by default (6 with use_node_attr=True, adding chemical and physical properties). Edges connect spatially or sequentially adjacent elements, averaging ~62 bonds (~124 directed edges in PyG) per graph. The 6 classes correspond to the top-level EC (Enzyme Commission) numbers: Oxidoreductases, Transferases, Hydrolases, Lyases, Isomerases, and Ligases.

Why ENZYMES matters

ENZYMES tests GNN expressiveness. On MUTAG, even simple GCN achieves 85%+. On ENZYMES, the same architecture drops to ~45%. The difficulty comes from the task itself: different enzyme types can have similar overall structures but differ in specific local motifs (active sites, binding pockets). Detecting these fine-grained structural patterns requires more expressive GNN layers.

This makes ENZYMES a valuable diagnostic: if your GNN improvement helps on MUTAG but not on ENZYMES, it may be capturing easy patterns while missing harder structural features. GIN (Graph Isomorphism Network) outperforms GCN on ENZYMES by a wider margin than on MUTAG, confirming that expressiveness matters more on harder tasks.

Loading ENZYMES in PyG

load_enzymes.py

from torch_geometric.datasets import TUDataset
from torch_geometric.loader import DataLoader

dataset = TUDataset(root='/tmp/ENZYMES', name='ENZYMES')
print(f"Graphs: {len(dataset)}")          # 600
print(f"Features: {dataset.num_features}")  # 3
print(f"Classes: {dataset.num_classes}")    # 6

loader = DataLoader(dataset, batch_size=32, shuffle=True)

Standard TUDataset API. Use 10-fold cross-validation for evaluation.

Original Paper

Protein Function Prediction via Graph Kernels

Karsten M. Borgwardt, Cheng Soon Ong, Stefan Schonauer, S.V.N. Vishwanathan, Alex J. Smola, Hans-Peter Kriegel (2005). Bioinformatics, 21(Suppl 1), i47-i56

Benchmark comparison (10-fold cross-validation)

Method	Accuracy	Year	Paper
WL kernel	~53.2%	2011	Shervashidze et al.
GCN + mean pool	~44.8%	2017	Kipf & Welling
GIN	~59.6%	2019	Xu et al.
PNA	~62.5%	2020	Corso et al.
CIN	~66.0%	2021	Bodnar et al.

Which graph classification dataset should I use?

MUTAG (188 graphs, 2 classes) is the quickest sanity check -- if your model fails here, the code is broken. ENZYMES (600 graphs, 6 classes) is the hardest of the three TUDataset classics (~50% accuracy) and tests GNN expressiveness. PROTEINS (1,113 graphs, 2 classes) has the most graphs and gives more reliable statistical estimates. Run all three: if your method helps on MUTAG but not ENZYMES, it may only capture easy patterns.

Common tasks and benchmarks

6-class graph classification with 10-fold cross-validation. GCN with global mean pooling: ~40-50%. GIN: ~50-60%. PNA (Principal Neighbourhood Aggregation): ~55-65%. The variance is high even with 10-fold CV due to the small dataset size. Methods that achieve above 65% typically use data augmentation or pretraining strategies.

Data source

ENZYMES is part of the TUDataset collection and originally from the BRENDA enzyme database. The graph version is available from the TUDataset benchmark suite. PyG downloads it automatically.

BibTeX citation

enzymes.bib

@article{borgwardt2005protein,
  title={Protein Function Prediction via Graph Kernels},
  author={Borgwardt, Karsten M. and Ong, Cheng Soon and Sch{\"o}nauer, Stefan and Vishwanathan, S.V.N. and Smola, Alex J. and Kriegel, Hans-Peter},
  journal={Bioinformatics},
  volume={21},
  number={Suppl 1},
  pages={i47--i56},
  year={2005}
}

@article{morris2020tudataset,
  title={TUDataset: A collection of benchmark datasets for learning with graphs},
  author={Morris, Christopher and Kriege, Nils M. and Bause, Franka and Kersting, Kristian and Mutzel, Petra and Neumann, Marion},
  journal={arXiv preprint arXiv:2007.08663},
  year={2020}
}

Cite Borgwardt et al. for the ENZYMES dataset, Morris et al. for the TUDataset collection.

Example: protein function prediction

Biotech companies need to predict enzyme function from structure to design better catalysts for industrial processes. An enzyme that efficiently breaks down plastic waste could be worth billions. GNN-based function prediction accelerates this search by screening candidate protein structures computationally before expensive laboratory synthesis. ENZYMES benchmarks the core classification capability underlying these industrial applications.

From benchmark to production

Production protein function prediction uses 3D coordinates, amino acid sequences, evolutionary information (multiple sequence alignments), and datasets thousands of times larger than ENZYMES. AlphaFold and ESMFold have transformed the field by predicting structure from sequence. The next frontier is predicting function from predicted structure -- where graph neural networks play a central role.

ENZYMES: A Hard Multi-Class Benchmark for Graph Classification