Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Dataset7 min read

PPI: The Benchmark That Proved Attention Matters for GNNs

PPI is a collection of 24 protein-protein interaction graphs (20 train, 2 val, 2 test) with 121 multi-label targets. It is an inductive benchmark: the model must generalize to entirely unseen graphs. GAT's 97.3% F1 on PPI (vs GraphSAGE's 61.2%) was the result that established attention as essential for GNNs.

PyTorch Geometric

TL;DR

  • 1PPI has 24 protein-protein interaction graphs (20 train, 2 val, 2 test) with 50-dimensional node features and 121 binary labels per node. The task is inductive multi-label node classification.
  • 2Inductive setting: train on 20 graphs, validate on 2, test on 2 separate graphs. The model must generalize to proteins and interactions it has never seen.
  • 3GAT achieved 97.3% micro-F1 vs GraphSAGE's 61.2%. This 36-point gap proved that attention mechanisms are critical when different neighbors carry different label signals.
  • 4PPI's multi-label, multi-graph structure mirrors production biological networks where proteins have multiple functions and interaction networks vary across tissues.

24

Graphs

50

Features

121 (multi)

Labels

Inductive

Task

What PPI contains

PPI is a collection of 24 protein-protein interaction (PPI) graphs representing different human tissues. Each graph captures which proteins interact in a specific tissue context. Nodes are proteins with 50-dimensional features derived from gene sets, immunological signatures, and positional gene information. Each node has 121 binary labels from gene ontology functional sets. The task is to predict all 121 functions for each protein.

The inductive setup is critical: the 24 graphs are split into 20 for training, 2 for validation, and 2 for testing. The test graphs are entirely separate from training. The model cannot memorize node identities or graph structure -- it must learn generalizable patterns about how protein interactions predict function.

Why PPI matters

PPI was the benchmark that established attention mechanisms as essential for GNNs. When Velickovic et al. introduced GAT in 2017, they showed 97.3% micro-F1 on PPI versus GraphSAGE's 61.2%. The 36-point gap was striking and had a clear explanation: PPI's 121 labels mean that different neighbors are informative for different labels. A protein shares DNA-binding function with one neighbor and catalytic function with another. GAT's learned attention weights can selectively aggregate the right neighbors for each prediction. GraphSAGE's fixed aggregation cannot.

PPI also normalized inductive evaluation. Before PPI, most GNN benchmarks were transductive (train and test on the same graph). PPI showed that GNNs can generalize across graphs, which is essential for production deployment where new data arrives continuously.

Loading PPI in PyG

load_ppi.py
from torch_geometric.datasets import PPI
from torch_geometric.loader import DataLoader

train_dataset = PPI(root='/tmp/PPI', split='train')  # 20 graphs (24 total)
val_dataset = PPI(root='/tmp/PPI', split='val')      # 2 graphs
test_dataset = PPI(root='/tmp/PPI', split='test')    # 2 graphs

print(f"Train graphs: {len(train_dataset)}")  # 20
print(f"Labels per node: {train_dataset[0].y.shape[1]}")  # 121

train_loader = DataLoader(train_dataset, batch_size=2)

Separate train/val/test graphs. Use BCEWithLogitsLoss for multi-label classification.

Common tasks and benchmarks

Inductive multi-label node classification with micro-averaged F1. GCN: ~50% (poor without attention). GraphSAGE: ~61.2%. GAT: ~97.3%. GATv2: ~97.8%. The attention-based methods dominate because selective neighbor aggregation is essential for multi-label tasks where different neighbors carry different label signals.

Example: drug target identification

Pharmaceutical companies use PPI networks to identify drug targets. If a disease-associated protein interacts with a druggable protein, targeting that interaction could treat the disease. GNNs on PPI graphs predict which proteins have which functions (the 121 labels), helping prioritize which interactions to investigate. This graph-based target identification has contributed to real drug discovery programs at companies like AstraZeneca and Novartis.

Published benchmark results

Micro-averaged F1 score on the PPI test split (2 held-out graphs). Higher is better.

MethodMicro-F1 (%)YearPaper
GraphSAGE61.22017Hamilton et al.
GAT97.32018Velickovic et al.
GATv2~97.82022Brody et al.
GCN (3-layer)~502017Kipf & Welling
JK-Net~97.62018Xu et al.

Original Paper

Predicting Multicellular Function through Multi-layer Tissue Networks

Marinka Zitnik, Jure Leskovec (2017). Bioinformatics, 33(14), i190-i198

Read paper →

Original data source

The PPI dataset was introduced by Hamilton et al. (2017) using data from the Molecular Signatures Database and the BioGRID protein interaction repository. The processed version used by PyG is available from the GraphSAGE project page.

cite_ppi.bib
@inproceedings{hamilton2017inductive,
  title={Inductive Representation Learning on Large Graphs},
  author={Hamilton, William L and Ying, Rex and Leskovec, Jure},
  booktitle={NeurIPS},
  pages={1024--1034},
  year={2017}
}

BibTeX citation for the PPI benchmark as used in the GraphSAGE paper.

Which dataset should I use?

PPI vs Cora: PPI is inductive (separate train/test graphs), multi-label (121 targets), and multi-graph (24 graphs). Cora is transductive, single-label, single-graph. Use PPI to test inductive generalization; use Cora for transductive baselines.

PPI vs Yelp: Both are multi-label node classification. PPI has 24 small graphs (~2,000 nodes each); Yelp is a single 716K-node graph. PPI tests inductive learning; Yelp tests transductive scaling.

PPI vs PROTEINS: PPI is node classification on interaction networks. PROTEINS is graph classification on protein structures. Different tasks on different biological data.

From benchmark to production

Production PPI analysis uses tissue-specific interaction networks, disease-context interactions, drug-target edges, and temporal expression data. The graphs are larger (human PPI networks have ~20K proteins), the labels are richer (disease associations, druggability scores), and the task includes link prediction (which proteins will interact?) alongside node classification.

Frequently asked questions

What is the PPI dataset?

PPI is a collection of 24 protein-protein interaction graphs from different human tissues (20 train, 2 val, 2 test). Nodes are proteins with 50-dimensional features (gene sets, immunological signatures, positional information). Each node has 121 binary labels (gene ontology sets). The task is inductive multi-label node classification.

What makes PPI an inductive benchmark?

PPI has 24 total graphs split into separate sets for training (20 graphs), validation (2 graphs), and testing (2 graphs). The model must generalize to entirely unseen graphs, not just unseen nodes within a known graph. This inductive setting is harder and more realistic than transductive benchmarks like Cora.

How do I load PPI in PyTorch Geometric?

Use `from torch_geometric.datasets import PPI; train_dataset = PPI(root='/tmp/PPI', split='train')`. Load train/val/test splits separately. Use DataLoader to batch multiple graphs.

What metric is used for PPI?

Micro-averaged F1 score, standard for multi-label classification. GraphSAGE achieves ~61.2% F1, GAT ~97.3% F1 (a landmark result from the GAT paper). The huge gap shows attention mechanisms are critical for PPI's multi-label protein function prediction.

Why did GAT perform so much better than GraphSAGE on PPI?

PPI's multi-label task requires the model to attend to different neighbors for different labels. A protein might share one function with neighbor A and another with neighbor B. GAT's per-edge attention weights enable this selective aggregation; GraphSAGE's fixed aggregation cannot.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.