What is the MUTAG dataset?

MUTAG is a collection of 188 molecular graphs where each graph represents a chemical compound. Nodes are atoms (avg 18 per molecule) with 7-dimensional features (atom type), and edges represent chemical bonds (avg ~20 bonds per molecule, stored as ~40 directed edges in PyG). The binary classification task is predicting whether a compound is mutagenic.

How do I load MUTAG in PyTorch Geometric?

Use `from torch_geometric.datasets import TUDataset; dataset = TUDataset(root='/tmp/MUTAG', name='MUTAG')`. Each element in the dataset is a separate molecular graph with its own nodes, edges, and label.

What is graph classification vs node classification?

In node classification (Cora, Reddit), you predict a label for each node within a single graph. In graph classification (MUTAG), you predict a single label for an entire graph. MUTAG requires a graph-level readout (pooling all node embeddings into one graph embedding) before classification.

What accuracy should I expect on MUTAG?

GCN with global mean pooling achieves ~85-88%. GIN (Graph Isomorphism Network) achieves ~89-92%. With only 188 graphs, results have high variance across random splits. Always report mean and standard deviation over multiple splits.

Why is MUTAG still relevant despite being so small?

MUTAG's tiny size (188 graphs) makes it the fastest possible sanity check for graph classification code. If your graph-level GNN cannot learn MUTAG, the implementation has a bug. It also introduces molecular graph concepts (atom features, bond connectivity) used in larger molecular benchmarks like QM9 and ZINC.

MUTAG Dataset: Molecular Mutagenicity Prediction | PyG Guide

188

Graphs

~18

Avg Nodes

Node Features

Classes

What MUTAG contains

MUTAG is a dataset of 188 nitroaromatic compounds tested for mutagenicity on Salmonella typhimurium. Each compound is represented as a molecular graph: atoms are nodes, chemical bonds are edges. Nodes have 7-dimensional one-hot features encoding atom type (C, N, O, F, I, Cl, Br). The binary label indicates whether the compound is mutagenic (causes genetic mutations) or not.

Molecules are small: the average graph has ~18 atoms and ~20 bonds (~40 directed edges as stored by PyG). This makes MUTAG trivially fast to process. The entire dataset fits in memory thousands of times over. Its value is pedagogical and diagnostic, not as a serious benchmark for molecular property prediction.

Why MUTAG matters

MUTAG is the entry point to graph-level prediction. On Cora or Reddit, you predict a label for each node. On MUTAG, you predict a single label for each graph. This requires a fundamentally different pipeline: after GNN message passing, you need a readout function that aggregates all node embeddings into one graph-level representation (global mean pooling, global max pooling, or Set2Set attention).

MUTAG also introduces molecular graph ML, a field with enormous industrial impact. Pharmaceutical companies use graph neural networks to predict drug properties (toxicity, solubility, binding affinity) from molecular structure. MUTAG's mutagenicity task is a simplified version of these production drug discovery tasks.

Loading MUTAG in PyG

load_mutag.py

from torch_geometric.datasets import TUDataset
from torch_geometric.loader import DataLoader

dataset = TUDataset(root='/tmp/MUTAG', name='MUTAG')
print(f"Graphs: {len(dataset)}")          # 188
print(f"Features: {dataset.num_features}")  # 7
print(f"Classes: {dataset.num_classes}")    # 2

# Each element is a separate graph
graph = dataset[0]
print(f"Nodes: {graph.num_nodes}, Edges: {graph.num_edges}")

# DataLoader batches multiple graphs together
loader = DataLoader(dataset, batch_size=32, shuffle=True)

DataLoader handles batching multiple graphs. Each batch is a disjoint union of graphs.

Original Paper

Structure-Activity Relationship of Mutagenic Aromatic and Heteroaromatic Nitro Compounds

Asim Kumar Debnath, Rosa L. Lopez de Compadre, Gargi Debnath, Alan J. Shusterman, Corwin Hansch (1991). Journal of Medicinal Chemistry, 34(2), 786-797

Benchmark comparison (10-fold cross-validation)

Method	Accuracy	Year	Paper
WL kernel	~84.1%	2011	Shervashidze et al.
GCN + mean pool	~85.6%	2017	Kipf & Welling
GraphSAGE	~85.1%	2017	Hamilton et al.
GIN	89.4%	2019	Xu et al.
PNA	~90.0%	2020	Corso et al.

Which graph classification dataset should I use?

MUTAG (188 graphs, 2 classes) is the fastest sanity check for graph classification code. Use it first to verify your pipeline. ENZYMES (600 graphs, 6 classes) is harder with low accuracy (~50%) and tests GNN expressiveness. PROTEINS (1,113 graphs, 2 classes) is the largest of the three, with more reliable evaluation due to more graphs. For molecular-specific benchmarking beyond these TUDatasets, use QM9 (130K, regression) or ZINC (250K, regression).

Common tasks and benchmarks

Binary graph classification with 10-fold cross-validation. The small dataset size means results vary significantly across folds and random seeds. GCN with global mean pooling: ~85-88%. GIN (Graph Isomorphism Network): ~89-92%. GraphSAGE: ~85-87%. GIN's higher expressiveness (it can distinguish more graph structures) gives it an edge on molecular tasks.

Data source

MUTAG is part of the TUDataset collection. The original data is available from the TUDataset benchmark suite. PyG downloads it automatically via the TUDataset loader.

BibTeX citation

mutag.bib

@article{debnath1991structure,
  title={Structure-Activity Relationship of Mutagenic Aromatic and Heteroaromatic Nitro Compounds. Correlation with Molecular Orbital Energies and Hydrophobicity},
  author={Debnath, Asim Kumar and Lopez de Compadre, Rosa L. and Debnath, Gargi and Shusterman, Alan J. and Hansch, Corwin},
  journal={Journal of Medicinal Chemistry},
  volume={34},
  number={2},
  pages={786--797},
  year={1991}
}

@article{morris2020tudataset,
  title={TUDataset: A collection of benchmark datasets for learning with graphs},
  author={Morris, Christopher and Kriege, Nils M. and Bause, Franka and Kersting, Kristian and Mutzel, Petra and Neumann, Marion},
  journal={arXiv preprint arXiv:2007.08663},
  year={2020}
}

Cite Debnath et al. for the original dataset, Morris et al. for the TUDataset benchmark collection.

Example: drug toxicity screening

Pharmaceutical companies screen thousands of candidate molecules for toxicity before clinical trials. Each molecule is a graph. A GNN trained on historical toxicity data predicts whether new candidates are likely toxic, saving months of laboratory testing. MUTAG's mutagenicity prediction is a simplified version of this pipeline. Production drug discovery uses larger datasets (QM9 with 130K molecules, ZINC with 250K) and predicts multiple properties simultaneously.

From benchmark to production

Production molecular property prediction uses 3D coordinates (not just 2D connectivity), edge features (bond type, stereochemistry), and multi-task outputs (predict toxicity, solubility, and binding affinity simultaneously). The scale jumps from 188 molecules to billions of candidates generated by combinatorial chemistry. But the core approach -- GNN message passing on molecular graphs followed by graph-level readout -- remains the same.

MUTAG: The Smallest Molecular Benchmark and Your First Graph Classification Task