Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Dataset6 min read

ENZYMES: A Hard Multi-Class Benchmark for Graph Classification

ENZYMES is a dataset of 600 protein structure graphs classified into 6 enzyme types. It is harder than MUTAG, with lower accuracy across all GNN architectures, testing whether models can learn subtle structural differences between enzyme classes.

PyTorch Geometric

TL;DR

  • 1ENZYMES has 600 enzyme graphs averaging 33 nodes and 124 edges each. Node features are 3-dimensional. The 6-class task classifies enzymes by their EC top-level number.
  • 2This is a hard benchmark: GCN achieves only ~40-50%, GIN ~50-60%. The low accuracy reflects subtle structural differences between enzyme classes.
  • 3Larger than MUTAG (600 vs 188 graphs) with more classes (6 vs 2), ENZYMES is a more meaningful test of graph classification capabilities.
  • 4Enzyme classification is relevant to drug discovery and protein engineering, where understanding protein function from structure has direct commercial value.

600

Graphs

~33

Avg Nodes

3

Node Features

6

Classes

What ENZYMES contains

ENZYMES contains 600 protein tertiary structure graphs from the BRENDA enzyme database. Each graph represents one enzyme. Nodes are secondary structure elements (helices, sheets, turns) with 3 node attributes by default (6 with use_node_attr=True, adding chemical and physical properties). Edges connect spatially or sequentially adjacent elements, averaging ~62 bonds (~124 directed edges in PyG) per graph. The 6 classes correspond to the top-level EC (Enzyme Commission) numbers: Oxidoreductases, Transferases, Hydrolases, Lyases, Isomerases, and Ligases.

Why ENZYMES matters

ENZYMES tests GNN expressiveness. On MUTAG, even simple GCN achieves 85%+. On ENZYMES, the same architecture drops to ~45%. The difficulty comes from the task itself: different enzyme types can have similar overall structures but differ in specific local motifs (active sites, binding pockets). Detecting these fine-grained structural patterns requires more expressive GNN layers.

This makes ENZYMES a valuable diagnostic: if your GNN improvement helps on MUTAG but not on ENZYMES, it may be capturing easy patterns while missing harder structural features. GIN (Graph Isomorphism Network) outperforms GCN on ENZYMES by a wider margin than on MUTAG, confirming that expressiveness matters more on harder tasks.

Loading ENZYMES in PyG

load_enzymes.py
from torch_geometric.datasets import TUDataset
from torch_geometric.loader import DataLoader

dataset = TUDataset(root='/tmp/ENZYMES', name='ENZYMES')
print(f"Graphs: {len(dataset)}")          # 600
print(f"Features: {dataset.num_features}")  # 3
print(f"Classes: {dataset.num_classes}")    # 6

loader = DataLoader(dataset, batch_size=32, shuffle=True)

Standard TUDataset API. Use 10-fold cross-validation for evaluation.

Original Paper

Protein Function Prediction via Graph Kernels

Karsten M. Borgwardt, Cheng Soon Ong, Stefan Schonauer, S.V.N. Vishwanathan, Alex J. Smola, Hans-Peter Kriegel (2005). Bioinformatics, 21(Suppl 1), i47-i56

Benchmark comparison (10-fold cross-validation)

MethodAccuracyYearPaper
WL kernel~53.2%2011Shervashidze et al.
GCN + mean pool~44.8%2017Kipf & Welling
GIN~59.6%2019Xu et al.
PNA~62.5%2020Corso et al.
CIN~66.0%2021Bodnar et al.

Which graph classification dataset should I use?

MUTAG (188 graphs, 2 classes) is the quickest sanity check -- if your model fails here, the code is broken. ENZYMES (600 graphs, 6 classes) is the hardest of the three TUDataset classics (~50% accuracy) and tests GNN expressiveness. PROTEINS (1,113 graphs, 2 classes) has the most graphs and gives more reliable statistical estimates. Run all three: if your method helps on MUTAG but not ENZYMES, it may only capture easy patterns.

Common tasks and benchmarks

6-class graph classification with 10-fold cross-validation. GCN with global mean pooling: ~40-50%. GIN: ~50-60%. PNA (Principal Neighbourhood Aggregation): ~55-65%. The variance is high even with 10-fold CV due to the small dataset size. Methods that achieve above 65% typically use data augmentation or pretraining strategies.

Data source

ENZYMES is part of the TUDataset collection and originally from the BRENDA enzyme database. The graph version is available from the TUDataset benchmark suite. PyG downloads it automatically.

BibTeX citation

enzymes.bib
@article{borgwardt2005protein,
  title={Protein Function Prediction via Graph Kernels},
  author={Borgwardt, Karsten M. and Ong, Cheng Soon and Sch{\"o}nauer, Stefan and Vishwanathan, S.V.N. and Smola, Alex J. and Kriegel, Hans-Peter},
  journal={Bioinformatics},
  volume={21},
  number={Suppl 1},
  pages={i47--i56},
  year={2005}
}

@article{morris2020tudataset,
  title={TUDataset: A collection of benchmark datasets for learning with graphs},
  author={Morris, Christopher and Kriege, Nils M. and Bause, Franka and Kersting, Kristian and Mutzel, Petra and Neumann, Marion},
  journal={arXiv preprint arXiv:2007.08663},
  year={2020}
}

Cite Borgwardt et al. for the ENZYMES dataset, Morris et al. for the TUDataset collection.

Example: protein function prediction

Biotech companies need to predict enzyme function from structure to design better catalysts for industrial processes. An enzyme that efficiently breaks down plastic waste could be worth billions. GNN-based function prediction accelerates this search by screening candidate protein structures computationally before expensive laboratory synthesis. ENZYMES benchmarks the core classification capability underlying these industrial applications.

From benchmark to production

Production protein function prediction uses 3D coordinates, amino acid sequences, evolutionary information (multiple sequence alignments), and datasets thousands of times larger than ENZYMES. AlphaFold and ESMFold have transformed the field by predicting structure from sequence. The next frontier is predicting function from predicted structure -- where graph neural networks play a central role.

Frequently asked questions

What is the ENZYMES dataset?

ENZYMES is a collection of 600 protein tertiary structure graphs from the BRENDA enzyme database. Each graph represents an enzyme with nodes (avg 33) as secondary structure elements and edges (avg 124) as spatial or sequential connections. The 6-class task is to classify enzymes by their EC top-level number.

How do I load ENZYMES in PyTorch Geometric?

Use `from torch_geometric.datasets import TUDataset; dataset = TUDataset(root='/tmp/ENZYMES', name='ENZYMES')`. Each element is a separate enzyme graph.

How does ENZYMES compare to MUTAG?

ENZYMES has more graphs (600 vs 188), more classes (6 vs 2), slightly larger graphs (avg 33 nodes vs 18), and lower node feature dimension (3 vs 7). The multi-class task and larger dataset make ENZYMES a harder and more meaningful benchmark than MUTAG.

What accuracy should I expect on ENZYMES?

GCN achieves ~40-50% accuracy, GIN ~50-60%. Random guessing would give ~16.7% (1/6 classes). ENZYMES is considered a hard graph classification benchmark. Results above 65% are competitive. Always use 10-fold cross-validation.

Why is ENZYMES accuracy so low compared to MUTAG?

ENZYMES has 6 classes (vs 2), lower-dimensional features (3 vs 7), and the enzyme classes are structurally subtle -- different enzyme types can have similar tertiary structures. The task requires the GNN to learn fine-grained structural patterns that simple message passing may miss.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.