Molecular property prediction uses graph neural networks to predict chemical and pharmacological properties from molecular structure. Atoms become nodes, chemical bonds become edges, and the GNN learns to map molecular topology (and optionally 3D geometry) to properties like toxicity, solubility, binding affinity, and quantum-mechanical energies. This is one of the highest-impact applications of graph ML, accelerating drug discovery from years-long experimental cycles to weeks of computational screening.
Molecular graph representation
Each molecule is converted to a graph with rich features:
from torch_geometric.data import Data
from rdkit import Chem
def mol_to_graph(smiles):
"""Convert SMILES string to PyG graph."""
mol = Chem.MolFromSmiles(smiles)
# Node features (per atom)
node_features = []
for atom in mol.GetAtoms():
node_features.append([
atom.GetAtomicNum(), # element (C=6, N=7, O=8)
atom.GetFormalCharge(), # charge
atom.GetNumImplicitHs(), # hydrogen count
atom.GetIsAromatic(), # aromaticity
atom.GetHybridization(), # sp, sp2, sp3
])
# Edge features (per bond)
edges, edge_features = [], []
for bond in mol.GetBonds():
i, j = bond.GetBeginAtomIdx(), bond.GetEndAtomIdx()
edges.extend([[i, j], [j, i]]) # undirected
feat = [bond.GetBondTypeAsDouble(), bond.GetIsAromatic()]
edge_features.extend([feat, feat])
return Data(x=tensor(node_features), edge_index=tensor(edges).T,
edge_attr=tensor(edge_features))SMILES to graph conversion. Each molecule becomes a compact graph of 10-100 nodes. The GNN learns which structural patterns predict the target property.
2D vs 3D molecular properties
2D graph properties
Many pharmacological properties depend primarily on molecular topology (which atoms are connected to which) rather than 3D geometry:
- Toxicity: Presence of toxic substructures (functional groups)
- Solubility: Balance of hydrophilic and hydrophobic groups
- Drug-likeness: Molecular weight, ring count, hydrogen bond donors/acceptors
Standard GNNs (GCN, GAT, GIN) work well for these properties. GIN is particularly popular because its sum aggregation is maximally expressive for graph-level classification.
3D geometric properties
Quantum-mechanical properties depend on exact 3D atom positions:
- Energy: Total electronic energy of the molecule
- Forces: Forces on each atom (for molecular dynamics)
- Electron density: Spatial distribution of electrons
These require equivariant GNNs (SchNet, PaiNN, MACE) that use 3D coordinates and respect rotational symmetry.
Graph-level prediction
Molecular property prediction is a graph-level task: one prediction per molecule, not per atom. This requires a readout (pooling) step that aggregates all node representations into a single molecular representation:
- Sum pooling: Preserves molecular size information (larger molecules have larger sums)
- Mean pooling: Size-invariant, suitable for intensive properties
- Set2Set / attention pooling: Learned, weighted aggregation that identifies the most important atoms
Key benchmarks
- QM9: 134K molecules, 12 quantum properties. The standard benchmark for 3D molecular GNNs.
- MoleculeNet: Suite of datasets for drug discovery properties (HIV inhibition, BBBP, Tox21). Standard for 2D molecular GNNs.
- PCQM4Mv2: 3.4M molecules, HOMO-LUMO gap prediction. Large-scale benchmark from OGB-LSC.
- MD17: Molecular dynamics trajectories for energy and force prediction. Standard for equivariant models.