Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Guide7 min read

Protein Structure: Representing and Predicting Protein 3D Structure with Graphs

A protein is a chain of amino acids that folds into a specific 3D shape. That shape determines function. Graph neural networks model proteins as residue graphs, enabling structure prediction, function annotation, and protein design.

PyTorch Geometric

TL;DR

  • 1Proteins as graphs: amino acid residues are nodes, spatial proximity (within 8-10 angstroms) and sequential connections form edges. Features include residue type, torsion angles, and secondary structure.
  • 2Two graph types: sequence graph (edges connect sequential residues, captures local structure) and contact graph (edges connect spatially close residues, captures 3D fold). Both are used together.
  • 3GNN applications: structure prediction (sequence to 3D), function prediction (structure to function), binding site prediction (which residues interact with drugs), and protein design (generating sequences for target structures).
  • 4AlphaFold2 uses graph-like attention over residue pairs (Evoformer) plus equivariant structure refinement. ESMFold uses protein language models with GNN-style structure modules.
  • 5Equivariance is essential: protein function depends on 3D shape, which is rotation-invariant. SE(3)-equivariant GNNs ensure predictions respect this physical symmetry.

A protein is a chain of amino acids that folds into a specific three-dimensional structure, and that structure determines its biological function. Graph neural networks represent proteins as residue-level graphs where nodes are amino acids and edges connect residues that are close in 3D space. This representation enables GNNs to predict structure from sequence, infer function from structure, and design new proteins with desired properties.

Protein graph construction

Two complementary graph representations capture different aspects of protein structure:

Sequence graph

Connects each amino acid to its sequential neighbors in the polypeptide chain. Edges follow the backbone: residue i connects to residue i-1 and i+1. This captures local structural motifs (alpha helices, beta strands) that depend on sequential patterns.

Contact graph

Connects residues that are spatially close in the folded 3D structure (typically within 8-10 angstroms between C-alpha atoms). Two residues far apart in sequence (positions 10 and 200) may be adjacent in 3D space. The contact graph captures the global fold.

Most protein GNNs use both: sequential edges for local context and contact edges for long-range 3D interactions.

Node and edge features

  • Node features: Amino acid type (20 classes), backbone torsion angles (phi, psi, omega), secondary structure (helix, strand, coil), solvent accessibility, and evolutionary conservation from multiple sequence alignments.
  • Edge features: Euclidean distance between C-alpha atoms, relative orientation (rotation matrix or quaternion), sequential distance along the chain, and hydrogen bond indicators.

GNN applications in protein science

Structure prediction

Predicting the 3D coordinates of each amino acid from the sequence. AlphaFold2 and ESMFold both use graph-like attention over residue pairs, with equivariant updates to refine 3D positions. These models have achieved experimental-level accuracy for most single-chain proteins.

Function prediction

Given a protein's structure, predict its biological function (enzyme class, binding specificity, cellular location). GNNs on the contact graph learn structural motifs that are associated with specific functions, even when sequence similarity is low.

Binding site prediction

Identify which residues on the protein surface interact with drug molecules. This is a node classification task on the protein graph: label each residue as binding-site or non-binding-site. GNNs capture the 3D shape complementarity that determines binding.

Protein design (inverse folding)

Given a target 3D structure, design an amino acid sequence that folds into that structure. This is the inverse of structure prediction. GNNs process the target structure graph and predict the optimal amino acid at each position. ProteinMPNN, a GNN-based design method, achieves experimental success rates above 50%.

Why equivariance matters for proteins

A protein's function depends on its shape, and shape is invariant under rotation. The same protein in different orientations has the same function. Equivariant GNNs guarantee this: predicted properties (function, binding) are invariant, while predicted coordinates are equivariant (they rotate consistently with the input).

Non-equivariant models require data augmentation with random rotations, which is wasteful and imperfect. Equivariant architectures (GVP, EGNN, PaiNN) achieve better accuracy with less training data.

Frequently asked questions

How are proteins represented as graphs?

Proteins are represented as graphs where nodes are amino acid residues (with features like residue type, secondary structure, and torsion angles). Edges connect residues that are close in 3D space (within 8-10 angstroms) or sequential in the chain. Edge features include distance, relative orientation, and backbone geometry.

What role do GNNs play in protein structure prediction?

GNNs are used in structure prediction (predicting 3D coordinates from sequence), structure refinement (improving predicted structures), function prediction (predicting protein function from structure), and design (generating new protein sequences for desired structures). AlphaFold2 uses a form of graph attention (Evoformer) over residue pairs.

How does AlphaFold relate to GNNs?

AlphaFold2's Evoformer module performs attention over pairs of residues, which is equivalent to message passing on a fully-connected residue graph. The Structure Module then refines 3D coordinates using equivariant updates. While not a standard GNN architecture, it applies the same principles of neighborhood aggregation and geometric awareness.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.