249,456
Molecules
~23
Avg Nodes
1
Node Features
Regression
Task
What ZINC contains
ZINC is a subset of the ZINC database of commercially available chemical compounds. The dataset contains 249,456 molecular graphs (with a standard 12K subset for benchmarking). Each molecule averages ~23 atoms (nodes). Node features are minimal: a single categorical value encoding atom type (C, N, O, etc.). The regression target is penalized logP -- a constrained version of the water-octanol partition coefficient measuring lipophilicity, used as a drug-likeness metric.
The deliberate feature sparsity is what makes ZINC important. On QM9, rich 11-dimensional features and 3D coordinates give models abundant information. ZINC strips this away. The model must learn to predict lipophilicity from molecular topology and atom types alone. This isolates the GNN's structural learning capability.
Why ZINC matters
ZINC has become the primary benchmark for GNN expressiveness research. The 500K parameter budget creates a controlled experiment: given the same compute budget, which architecture extracts the most from graph structure? The results are striking: GCN achieves MAE 0.367, GIN drops to 0.163, PNA reaches 0.099, and GPS (graph transformer) hits 0.070. This 5x improvement from GCN to GPS demonstrates that architecture design has a massive impact when features are limited.
For practitioners, ZINC answers a practical question: is it worth upgrading from a simple GCN to a more complex architecture? On feature-rich datasets, the answer is often no. On ZINC, where structure is all you have, the answer is definitively yes.
Loading ZINC in PyG
from torch_geometric.datasets import ZINC
# Standard 12K benchmark subset
train_dataset = ZINC(root='/tmp/ZINC', subset=True, split='train')
val_dataset = ZINC(root='/tmp/ZINC', subset=True, split='val')
test_dataset = ZINC(root='/tmp/ZINC', subset=True, split='test')
print(f"Train: {len(train_dataset)}") # 10000
print(f"Val: {len(val_dataset)}") # 1000
print(f"Test: {len(test_dataset)}") # 1000
mol = train_dataset[0]
print(f"Atoms: {mol.num_nodes}, Target: {mol.y.item():.3f}")The 12K subset (10K/1K/1K split) is the standard benchmark. Use subset=False for full 249K.
Common tasks and benchmarks
Penalized logP regression on the 12K subset with ~500K parameter budget. Standard benchmarks (from the Benchmarking GNNs paper): GCN: 0.367 MAE, GIN: 0.163, GAT: 0.384, GraphSage: 0.398, PNA: 0.099, GPS: 0.070. The trend is clear: methods that use multiple aggregation functions (PNA), positional encodings, and global attention (GPS) dramatically outperform simple averaging (GCN) on structure-dependent tasks.
Example: drug-likeness screening
Lipophilicity (measured by logP) is a critical drug property: compounds that are too hydrophobic or too hydrophilic fail as drugs regardless of their target binding. Pharmaceutical companies screen millions of candidates for drug-like properties early in the pipeline. A GNN trained on ZINC-style data predicts penalized logP from molecular structure in milliseconds, filtering out unsuitable candidates before expensive synthesis.
Published benchmark results
Results on the ZINC 12K subset with ~500K parameter budget. Metric is test MAE on penalized logP regression. Lower is better.
| Method | Test MAE | Year | Paper |
|---|---|---|---|
| GCN | 0.367 | 2020 | Dwivedi et al. |
| GAT | 0.384 | 2020 | Dwivedi et al. |
| GraphSage | 0.398 | 2020 | Dwivedi et al. |
| GIN | 0.163 | 2020 | Dwivedi et al. |
| GatedGCN | 0.282 | 2020 | Dwivedi et al. |
| PNA | 0.099 | 2020 | Corso et al. |
| SAN | 0.139 | 2022 | Kreuzer et al. |
| GPS | 0.070 | 2022 | Rampasek et al. |
Original Paper
Benchmarking Graph Neural Networks
V. P. Dwivedi, C. K. Joshi, A. T. Luu, T. Laurent, Y. Bengio, X. Bresson (2023). Journal of Machine Learning Research
Read paper →Original data source
The ZINC subset used in GNN benchmarking is derived from the ZINC database of commercially available chemical compounds. The benchmark subset was curated by Dwivedi et al. and is available via the benchmarking-gnns repository.
@article{dwivedi2023benchmarking,
title={Benchmarking Graph Neural Networks},
author={Dwivedi, Vijay Prakash and Joshi, Chaitanya K and Luu, Anh Tuan and Laurent, Thomas and Bengio, Yoshua and Bresson, Xavier},
journal={Journal of Machine Learning Research},
volume={24},
number={43},
pages={1--48},
year={2023}
}BibTeX citation for the ZINC GNN benchmark.
Which dataset should I use?
ZINC vs QM9: ZINC tests architecture expressiveness with minimal features (1 categorical). QM9 has rich 11D features plus 3D coordinates. Use ZINC to evaluate architecture design; use QM9 for molecular property prediction.
ZINC vs PATTERN: Both test GNN expressiveness, but ZINC is a graph-level regression task on molecular graphs, while PATTERN is a node-level binary classification on synthetic graphs. ZINC is more popular in the chemistry ML community.
ZINC 12K vs ZINC full: The 12K subset is the standard benchmark with the 500K parameter budget. The full 249K dataset is used when you want more training data or are not comparing to published baselines.
From benchmark to production
Production molecular property prediction uses richer features than ZINC allows (3D coordinates, bond types, charge distributions). But ZINC's lesson carries over: architectural expressiveness matters. The same graph transformer architectures that top the ZINC leaderboard also perform well on richer datasets. Investing in expressive architectures pays off regardless of feature richness.