188
Graphs
~18
Avg Nodes
7
Node Features
2
Classes
What MUTAG contains
MUTAG is a dataset of 188 nitroaromatic compounds tested for mutagenicity on Salmonella typhimurium. Each compound is represented as a molecular graph: atoms are nodes, chemical bonds are edges. Nodes have 7-dimensional one-hot features encoding atom type (C, N, O, F, I, Cl, Br). The binary label indicates whether the compound is mutagenic (causes genetic mutations) or not.
Molecules are small: the average graph has ~18 atoms and ~20 bonds (~40 directed edges as stored by PyG). This makes MUTAG trivially fast to process. The entire dataset fits in memory thousands of times over. Its value is pedagogical and diagnostic, not as a serious benchmark for molecular property prediction.
Why MUTAG matters
MUTAG is the entry point to graph-level prediction. On Cora or Reddit, you predict a label for each node. On MUTAG, you predict a single label for each graph. This requires a fundamentally different pipeline: after GNN message passing, you need a readout function that aggregates all node embeddings into one graph-level representation (global mean pooling, global max pooling, or Set2Set attention).
MUTAG also introduces molecular graph ML, a field with enormous industrial impact. Pharmaceutical companies use graph neural networks to predict drug properties (toxicity, solubility, binding affinity) from molecular structure. MUTAG's mutagenicity task is a simplified version of these production drug discovery tasks.
Loading MUTAG in PyG
from torch_geometric.datasets import TUDataset
from torch_geometric.loader import DataLoader
dataset = TUDataset(root='/tmp/MUTAG', name='MUTAG')
print(f"Graphs: {len(dataset)}") # 188
print(f"Features: {dataset.num_features}") # 7
print(f"Classes: {dataset.num_classes}") # 2
# Each element is a separate graph
graph = dataset[0]
print(f"Nodes: {graph.num_nodes}, Edges: {graph.num_edges}")
# DataLoader batches multiple graphs together
loader = DataLoader(dataset, batch_size=32, shuffle=True)DataLoader handles batching multiple graphs. Each batch is a disjoint union of graphs.
Original Paper
Structure-Activity Relationship of Mutagenic Aromatic and Heteroaromatic Nitro Compounds
Asim Kumar Debnath, Rosa L. Lopez de Compadre, Gargi Debnath, Alan J. Shusterman, Corwin Hansch (1991). Journal of Medicinal Chemistry, 34(2), 786-797
Benchmark comparison (10-fold cross-validation)
| Method | Accuracy | Year | Paper |
|---|---|---|---|
| WL kernel | ~84.1% | 2011 | Shervashidze et al. |
| GCN + mean pool | ~85.6% | 2017 | Kipf & Welling |
| GraphSAGE | ~85.1% | 2017 | Hamilton et al. |
| GIN | 89.4% | 2019 | Xu et al. |
| PNA | ~90.0% | 2020 | Corso et al. |
Which graph classification dataset should I use?
MUTAG (188 graphs, 2 classes) is the fastest sanity check for graph classification code. Use it first to verify your pipeline. ENZYMES (600 graphs, 6 classes) is harder with low accuracy (~50%) and tests GNN expressiveness. PROTEINS (1,113 graphs, 2 classes) is the largest of the three, with more reliable evaluation due to more graphs. For molecular-specific benchmarking beyond these TUDatasets, use QM9 (130K, regression) or ZINC (250K, regression).
Common tasks and benchmarks
Binary graph classification with 10-fold cross-validation. The small dataset size means results vary significantly across folds and random seeds. GCN with global mean pooling: ~85-88%. GIN (Graph Isomorphism Network): ~89-92%. GraphSAGE: ~85-87%. GIN's higher expressiveness (it can distinguish more graph structures) gives it an edge on molecular tasks.
Data source
MUTAG is part of the TUDataset collection. The original data is available from the TUDataset benchmark suite. PyG downloads it automatically via the TUDataset loader.
BibTeX citation
@article{debnath1991structure,
title={Structure-Activity Relationship of Mutagenic Aromatic and Heteroaromatic Nitro Compounds. Correlation with Molecular Orbital Energies and Hydrophobicity},
author={Debnath, Asim Kumar and Lopez de Compadre, Rosa L. and Debnath, Gargi and Shusterman, Alan J. and Hansch, Corwin},
journal={Journal of Medicinal Chemistry},
volume={34},
number={2},
pages={786--797},
year={1991}
}
@article{morris2020tudataset,
title={TUDataset: A collection of benchmark datasets for learning with graphs},
author={Morris, Christopher and Kriege, Nils M. and Bause, Franka and Kersting, Kristian and Mutzel, Petra and Neumann, Marion},
journal={arXiv preprint arXiv:2007.08663},
year={2020}
}Cite Debnath et al. for the original dataset, Morris et al. for the TUDataset benchmark collection.
Example: drug toxicity screening
Pharmaceutical companies screen thousands of candidate molecules for toxicity before clinical trials. Each molecule is a graph. A GNN trained on historical toxicity data predicts whether new candidates are likely toxic, saving months of laboratory testing. MUTAG's mutagenicity prediction is a simplified version of this pipeline. Production drug discovery uses larger datasets (QM9 with 130K molecules, ZINC with 250K) and predicts multiple properties simultaneously.
From benchmark to production
Production molecular property prediction uses 3D coordinates (not just 2D connectivity), edge features (bond type, stereochemistry), and multi-task outputs (predict toxicity, solubility, and binding affinity simultaneously). The scale jumps from 188 molecules to billions of candidates generated by combinatorial chemistry. But the core approach -- GNN message passing on molecular graphs followed by graph-level readout -- remains the same.