What is the ZINC dataset?

ZINC is a subset of 249,456 molecules from the ZINC database used for penalized logP (lipophilicity) regression. Each molecule is a graph with 1 node feature (atom type, categorical) and no explicit edge features. The constrained setup (single feature) tests GNN expressiveness rather than feature engineering.

How do I load ZINC in PyTorch Geometric?

Use `from torch_geometric.datasets import ZINC; dataset = ZINC(root='/tmp/ZINC', subset=True)`. The subset=True flag loads the standard 12K-graph benchmark subset. Use subset=False for the full 249K dataset.

What makes ZINC different from QM9?

ZINC tests GNN expressiveness by restricting node features to a single categorical value (atom type). QM9 has rich 11-dimensional features and 3D coordinates. ZINC forces the model to learn from graph structure and topology, not from rich features. It also has a single regression target (penalized logP) vs QM9's 19.

What MAE should I expect on ZINC?

On the 12K subset: GCN ~0.367, GIN ~0.163, PNA ~0.099, GPS (Graph Transformer) ~0.070. Lower is better. The standard benchmark constrains models to ~500K parameters. Results below 0.1 MAE are competitive.

Why is the 500K parameter budget important?

The ZINC benchmark constrains models to ~500K parameters to ensure fair comparison of architectural efficiency, not just model size. A huge model can memorize the training set. The parameter budget forces architectures to be efficient, revealing which message-passing designs extract the most information per parameter.

ZINC Dataset: Molecular Solubility Prediction | PyG Guide

249,456

Molecules

~23

Avg Nodes

Node Features

Regression

Task

What ZINC contains

ZINC is a subset of the ZINC database of commercially available chemical compounds. The dataset contains 249,456 molecular graphs (with a standard 12K subset for benchmarking). Each molecule averages ~23 atoms (nodes). Node features are minimal: a single categorical value encoding atom type (C, N, O, etc.). The regression target is penalized logP -- a constrained version of the water-octanol partition coefficient measuring lipophilicity, used as a drug-likeness metric.

The deliberate feature sparsity is what makes ZINC important. On QM9, rich 11-dimensional features and 3D coordinates give models abundant information. ZINC strips this away. The model must learn to predict lipophilicity from molecular topology and atom types alone. This isolates the GNN's structural learning capability.

Why ZINC matters

ZINC has become the primary benchmark for GNN expressiveness research. The 500K parameter budget creates a controlled experiment: given the same compute budget, which architecture extracts the most from graph structure? The results are striking: GCN achieves MAE 0.367, GIN drops to 0.163, PNA reaches 0.099, and GPS (graph transformer) hits 0.070. This 5x improvement from GCN to GPS demonstrates that architecture design has a massive impact when features are limited.

For practitioners, ZINC answers a practical question: is it worth upgrading from a simple GCN to a more complex architecture? On feature-rich datasets, the answer is often no. On ZINC, where structure is all you have, the answer is definitively yes.

Loading ZINC in PyG

load_zinc.py

from torch_geometric.datasets import ZINC

# Standard 12K benchmark subset
train_dataset = ZINC(root='/tmp/ZINC', subset=True, split='train')
val_dataset = ZINC(root='/tmp/ZINC', subset=True, split='val')
test_dataset = ZINC(root='/tmp/ZINC', subset=True, split='test')

print(f"Train: {len(train_dataset)}")   # 10000
print(f"Val: {len(val_dataset)}")       # 1000
print(f"Test: {len(test_dataset)}")     # 1000

mol = train_dataset[0]
print(f"Atoms: {mol.num_nodes}, Target: {mol.y.item():.3f}")

The 12K subset (10K/1K/1K split) is the standard benchmark. Use subset=False for full 249K.

Common tasks and benchmarks

Penalized logP regression on the 12K subset with ~500K parameter budget. Standard benchmarks (from the Benchmarking GNNs paper): GCN: 0.367 MAE, GIN: 0.163, GAT: 0.384, GraphSage: 0.398, PNA: 0.099, GPS: 0.070. The trend is clear: methods that use multiple aggregation functions (PNA), positional encodings, and global attention (GPS) dramatically outperform simple averaging (GCN) on structure-dependent tasks.

Example: drug-likeness screening

Lipophilicity (measured by logP) is a critical drug property: compounds that are too hydrophobic or too hydrophilic fail as drugs regardless of their target binding. Pharmaceutical companies screen millions of candidates for drug-like properties early in the pipeline. A GNN trained on ZINC-style data predicts penalized logP from molecular structure in milliseconds, filtering out unsuitable candidates before expensive synthesis.

Published benchmark results

Results on the ZINC 12K subset with ~500K parameter budget. Metric is test MAE on penalized logP regression. Lower is better.

Method	Test MAE	Year	Paper
GCN	0.367	2020	Dwivedi et al.
GAT	0.384	2020	Dwivedi et al.
GraphSage	0.398	2020	Dwivedi et al.
GIN	0.163	2020	Dwivedi et al.
GatedGCN	0.282	2020	Dwivedi et al.
PNA	0.099	2020	Corso et al.
SAN	0.139	2022	Kreuzer et al.
GPS	0.070	2022	Rampasek et al.

Original Paper

Benchmarking Graph Neural Networks

V. P. Dwivedi, C. K. Joshi, A. T. Luu, T. Laurent, Y. Bengio, X. Bresson (2023). Journal of Machine Learning Research

Read paper →

Original data source

The ZINC subset used in GNN benchmarking is derived from the ZINC database of commercially available chemical compounds. The benchmark subset was curated by Dwivedi et al. and is available via the benchmarking-gnns repository.

cite_zinc_benchmark.bib

@article{dwivedi2023benchmarking,
  title={Benchmarking Graph Neural Networks},
  author={Dwivedi, Vijay Prakash and Joshi, Chaitanya K and Luu, Anh Tuan and Laurent, Thomas and Bengio, Yoshua and Bresson, Xavier},
  journal={Journal of Machine Learning Research},
  volume={24},
  number={43},
  pages={1--48},
  year={2023}
}

BibTeX citation for the ZINC GNN benchmark.

Which dataset should I use?

ZINC vs QM9: ZINC tests architecture expressiveness with minimal features (1 categorical). QM9 has rich 11D features plus 3D coordinates. Use ZINC to evaluate architecture design; use QM9 for molecular property prediction.

ZINC vs PATTERN: Both test GNN expressiveness, but ZINC is a graph-level regression task on molecular graphs, while PATTERN is a node-level binary classification on synthetic graphs. ZINC is more popular in the chemistry ML community.

ZINC 12K vs ZINC full: The 12K subset is the standard benchmark with the 500K parameter budget. The full 249K dataset is used when you want more training data or are not comparing to published baselines.

From benchmark to production

Production molecular property prediction uses richer features than ZINC allows (3D coordinates, bond types, charge distributions). But ZINC's lesson carries over: architectural expressiveness matters. The same graph transformer architectures that top the ZINC leaderboard also perform well on richer datasets. Investing in expressive architectures pays off regardless of feature richness.

ZINC: The Expressiveness Benchmark for Graph Neural Networks