Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Guide7 min read

Molecular Property Prediction: Predicting Chemical Properties from Molecular Graphs

A molecule is a graph: atoms are nodes, bonds are edges. GNNs learn to predict toxicity, solubility, and binding affinity from this structure, accelerating drug discovery from years to weeks.

PyTorch Geometric

TL;DR

  • 1Molecules are natural graphs: atoms as nodes (features: element, charge, hybridization), bonds as edges (features: bond type, aromaticity). GNNs learn structure-property relationships from this representation.
  • 2Two prediction regimes: 2D graph properties (toxicity, solubility from molecular topology) and 3D properties (energy, forces from atom coordinates requiring equivariant GNNs).
  • 3GNNs outperform molecular fingerprints (Morgan, ECFP) by 5-15% on standard benchmarks because they learn task-specific representations rather than using fixed, hand-designed features.
  • 4Drug discovery pipeline: virtual screening (filter billions of candidates), lead optimization (predict property changes from structural modifications), and ADMET prediction (absorption, distribution, metabolism).
  • 5Standard benchmarks: QM9 (quantum properties), MoleculeNet (pharmacological properties), PCQM4Mv2 (HOMO-LUMO gap). These enable rigorous comparison of molecular GNN architectures.

Molecular property prediction uses graph neural networks to predict chemical and pharmacological properties from molecular structure. Atoms become nodes, chemical bonds become edges, and the GNN learns to map molecular topology (and optionally 3D geometry) to properties like toxicity, solubility, binding affinity, and quantum-mechanical energies. This is one of the highest-impact applications of graph ML, accelerating drug discovery from years-long experimental cycles to weeks of computational screening.

Molecular graph representation

Each molecule is converted to a graph with rich features:

molecular_graph.py
from torch_geometric.data import Data
from rdkit import Chem

def mol_to_graph(smiles):
    """Convert SMILES string to PyG graph."""
    mol = Chem.MolFromSmiles(smiles)

    # Node features (per atom)
    node_features = []
    for atom in mol.GetAtoms():
        node_features.append([
            atom.GetAtomicNum(),        # element (C=6, N=7, O=8)
            atom.GetFormalCharge(),      # charge
            atom.GetNumImplicitHs(),    # hydrogen count
            atom.GetIsAromatic(),       # aromaticity
            atom.GetHybridization(),    # sp, sp2, sp3
        ])

    # Edge features (per bond)
    edges, edge_features = [], []
    for bond in mol.GetBonds():
        i, j = bond.GetBeginAtomIdx(), bond.GetEndAtomIdx()
        edges.extend([[i, j], [j, i]])  # undirected
        feat = [bond.GetBondTypeAsDouble(), bond.GetIsAromatic()]
        edge_features.extend([feat, feat])

    return Data(x=tensor(node_features), edge_index=tensor(edges).T,
                edge_attr=tensor(edge_features))

SMILES to graph conversion. Each molecule becomes a compact graph of 10-100 nodes. The GNN learns which structural patterns predict the target property.

2D vs 3D molecular properties

2D graph properties

Many pharmacological properties depend primarily on molecular topology (which atoms are connected to which) rather than 3D geometry:

  • Toxicity: Presence of toxic substructures (functional groups)
  • Solubility: Balance of hydrophilic and hydrophobic groups
  • Drug-likeness: Molecular weight, ring count, hydrogen bond donors/acceptors

Standard GNNs (GCN, GAT, GIN) work well for these properties. GIN is particularly popular because its sum aggregation is maximally expressive for graph-level classification.

3D geometric properties

Quantum-mechanical properties depend on exact 3D atom positions:

  • Energy: Total electronic energy of the molecule
  • Forces: Forces on each atom (for molecular dynamics)
  • Electron density: Spatial distribution of electrons

These require equivariant GNNs (SchNet, PaiNN, MACE) that use 3D coordinates and respect rotational symmetry.

Graph-level prediction

Molecular property prediction is a graph-level task: one prediction per molecule, not per atom. This requires a readout (pooling) step that aggregates all node representations into a single molecular representation:

  • Sum pooling: Preserves molecular size information (larger molecules have larger sums)
  • Mean pooling: Size-invariant, suitable for intensive properties
  • Set2Set / attention pooling: Learned, weighted aggregation that identifies the most important atoms

Key benchmarks

  • QM9: 134K molecules, 12 quantum properties. The standard benchmark for 3D molecular GNNs.
  • MoleculeNet: Suite of datasets for drug discovery properties (HIV inhibition, BBBP, Tox21). Standard for 2D molecular GNNs.
  • PCQM4Mv2: 3.4M molecules, HOMO-LUMO gap prediction. Large-scale benchmark from OGB-LSC.
  • MD17: Molecular dynamics trajectories for energy and force prediction. Standard for equivariant models.

Frequently asked questions

How are molecules represented as graphs?

Atoms become nodes with features (atomic number, charge, hybridization). Chemical bonds become edges with features (bond type: single/double/triple, aromaticity). 3D molecular graphs also include atom coordinates for spatial reasoning. This representation lets GNNs learn structure-property relationships.

What molecular properties can GNNs predict?

GNNs predict both quantum-mechanical properties (energy, electron density, orbital energies) and pharmacological properties (toxicity, solubility, binding affinity, bioavailability). Quantum properties require 3D coordinates and equivariant models. Pharmacological properties often work with 2D molecular graphs.

How does molecular property prediction compare to traditional methods?

GNNs outperform traditional molecular fingerprints (Morgan, ECFP) by 5-15% on most benchmarks because they learn task-specific representations rather than using fixed feature extractors. For 3D properties, equivariant GNNs approach the accuracy of quantum mechanics simulations at a fraction of the compute cost.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.