How are proteins represented as graphs?

Proteins are represented as graphs where nodes are amino acid residues (with features like residue type, secondary structure, and torsion angles). Edges connect residues that are close in 3D space (within 8-10 angstroms) or sequential in the chain. Edge features include distance, relative orientation, and backbone geometry.

What role do GNNs play in protein structure prediction?

GNNs are used in structure prediction (predicting 3D coordinates from sequence), structure refinement (improving predicted structures), function prediction (predicting protein function from structure), and design (generating new protein sequences for desired structures). AlphaFold2 uses a form of graph attention (Evoformer) over residue pairs.

How does AlphaFold relate to GNNs?

AlphaFold2's Evoformer module performs attention over pairs of residues, which is equivalent to message passing on a fully-connected residue graph. The Structure Module then refines 3D coordinates using equivariant updates. While not a standard GNN architecture, it applies the same principles of neighborhood aggregation and geometric awareness.

Protein Structure Prediction with Graphs: 3D Structure from Sequence | Kumo.ai

A protein is a chain of amino acids that folds into a specific three-dimensional structure, and that structure determines its biological function. Graph neural networks represent proteins as residue-level graphs where nodes are amino acids and edges connect residues that are close in 3D space. This representation enables GNNs to predict structure from sequence, infer function from structure, and design new proteins with desired properties.

Protein graph construction

Two complementary graph representations capture different aspects of protein structure:

Sequence graph

Connects each amino acid to its sequential neighbors in the polypeptide chain. Edges follow the backbone: residue i connects to residue i-1 and i+1. This captures local structural motifs (alpha helices, beta strands) that depend on sequential patterns.

Contact graph

Connects residues that are spatially close in the folded 3D structure (typically within 8-10 angstroms between C-alpha atoms). Two residues far apart in sequence (positions 10 and 200) may be adjacent in 3D space. The contact graph captures the global fold.

Most protein GNNs use both: sequential edges for local context and contact edges for long-range 3D interactions.

Node and edge features

Node features: Amino acid type (20 classes), backbone torsion angles (phi, psi, omega), secondary structure (helix, strand, coil), solvent accessibility, and evolutionary conservation from multiple sequence alignments.
Edge features: Euclidean distance between C-alpha atoms, relative orientation (rotation matrix or quaternion), sequential distance along the chain, and hydrogen bond indicators.

GNN applications in protein science

Structure prediction

Predicting the 3D coordinates of each amino acid from the sequence. AlphaFold2 and ESMFold both use graph-like attention over residue pairs, with equivariant updates to refine 3D positions. These models have achieved experimental-level accuracy for most single-chain proteins.

Function prediction

Given a protein's structure, predict its biological function (enzyme class, binding specificity, cellular location). GNNs on the contact graph learn structural motifs that are associated with specific functions, even when sequence similarity is low.

Binding site prediction

Identify which residues on the protein surface interact with drug molecules. This is a node classification task on the protein graph: label each residue as binding-site or non-binding-site. GNNs capture the 3D shape complementarity that determines binding.

Protein design (inverse folding)

Given a target 3D structure, design an amino acid sequence that folds into that structure. This is the inverse of structure prediction. GNNs process the target structure graph and predict the optimal amino acid at each position. ProteinMPNN, a GNN-based design method, achieves experimental success rates above 50%.

Why equivariance matters for proteins

A protein's function depends on its shape, and shape is invariant under rotation. The same protein in different orientations has the same function. Equivariant GNNs guarantee this: predicted properties (function, binding) are invariant, while predicted coordinates are equivariant (they rotate consistently with the input).

Non-equivariant models require data augmentation with random rotations, which is wasteful and imperfect. Equivariant architectures (GVP, EGNN, PaiNN) achieve better accuracy with less training data.

Key Takeaways

1Proteins are residue graphs: amino acids as nodes, spatial proximity and sequential connections as edges. Contact graphs capture the 3D fold; sequence graphs capture local motifs.
2Four GNN applications: structure prediction (sequence to 3D), function prediction (structure to function), binding site prediction (node classification), and protein design (inverse folding).
3AlphaFold2 and ESMFold use graph-like attention over residue pairs. ProteinMPNN uses GNNs for protein design with 50%+ experimental success rates.
4Equivariance is essential: protein function is rotation-invariant, so predictions must respect 3D symmetry. SE(3)-equivariant architectures outperform standard GNNs on all protein tasks.
5The field is rapidly advancing: GNNs are enabling computational protein engineering that was impossible five years ago, from designing enzymes to engineering antibodies.

Protein Structure: Representing and Predicting Protein 3D Structure with Graphs