A protein is a chain of amino acids that folds into a specific three-dimensional structure, and that structure determines its biological function. Graph neural networks represent proteins as residue-level graphs where nodes are amino acids and edges connect residues that are close in 3D space. This representation enables GNNs to predict structure from sequence, infer function from structure, and design new proteins with desired properties.
Protein graph construction
Two complementary graph representations capture different aspects of protein structure:
Sequence graph
Connects each amino acid to its sequential neighbors in the polypeptide chain. Edges follow the backbone: residue i connects to residue i-1 and i+1. This captures local structural motifs (alpha helices, beta strands) that depend on sequential patterns.
Contact graph
Connects residues that are spatially close in the folded 3D structure (typically within 8-10 angstroms between C-alpha atoms). Two residues far apart in sequence (positions 10 and 200) may be adjacent in 3D space. The contact graph captures the global fold.
Most protein GNNs use both: sequential edges for local context and contact edges for long-range 3D interactions.
Node and edge features
- Node features: Amino acid type (20 classes), backbone torsion angles (phi, psi, omega), secondary structure (helix, strand, coil), solvent accessibility, and evolutionary conservation from multiple sequence alignments.
- Edge features: Euclidean distance between C-alpha atoms, relative orientation (rotation matrix or quaternion), sequential distance along the chain, and hydrogen bond indicators.
GNN applications in protein science
Structure prediction
Predicting the 3D coordinates of each amino acid from the sequence. AlphaFold2 and ESMFold both use graph-like attention over residue pairs, with equivariant updates to refine 3D positions. These models have achieved experimental-level accuracy for most single-chain proteins.
Function prediction
Given a protein's structure, predict its biological function (enzyme class, binding specificity, cellular location). GNNs on the contact graph learn structural motifs that are associated with specific functions, even when sequence similarity is low.
Binding site prediction
Identify which residues on the protein surface interact with drug molecules. This is a node classification task on the protein graph: label each residue as binding-site or non-binding-site. GNNs capture the 3D shape complementarity that determines binding.
Protein design (inverse folding)
Given a target 3D structure, design an amino acid sequence that folds into that structure. This is the inverse of structure prediction. GNNs process the target structure graph and predict the optimal amino acid at each position. ProteinMPNN, a GNN-based design method, achieves experimental success rates above 50%.
Why equivariance matters for proteins
A protein's function depends on its shape, and shape is invariant under rotation. The same protein in different orientations has the same function. Equivariant GNNs guarantee this: predicted properties (function, binding) are invariant, while predicted coordinates are equivariant (they rotate consistently with the input).
Non-equivariant models require data augmentation with random rotations, which is wasteful and imperfect. Equivariant architectures (GVP, EGNN, PaiNN) achieve better accuracy with less training data.