24
Graphs
50
Features
121 (multi)
Labels
Inductive
Task
What PPI contains
PPI is a collection of 24 protein-protein interaction (PPI) graphs representing different human tissues. Each graph captures which proteins interact in a specific tissue context. Nodes are proteins with 50-dimensional features derived from gene sets, immunological signatures, and positional gene information. Each node has 121 binary labels from gene ontology functional sets. The task is to predict all 121 functions for each protein.
The inductive setup is critical: the 24 graphs are split into 20 for training, 2 for validation, and 2 for testing. The test graphs are entirely separate from training. The model cannot memorize node identities or graph structure -- it must learn generalizable patterns about how protein interactions predict function.
Why PPI matters
PPI was the benchmark that established attention mechanisms as essential for GNNs. When Velickovic et al. introduced GAT in 2017, they showed 97.3% micro-F1 on PPI versus GraphSAGE's 61.2%. The 36-point gap was striking and had a clear explanation: PPI's 121 labels mean that different neighbors are informative for different labels. A protein shares DNA-binding function with one neighbor and catalytic function with another. GAT's learned attention weights can selectively aggregate the right neighbors for each prediction. GraphSAGE's fixed aggregation cannot.
PPI also normalized inductive evaluation. Before PPI, most GNN benchmarks were transductive (train and test on the same graph). PPI showed that GNNs can generalize across graphs, which is essential for production deployment where new data arrives continuously.
Loading PPI in PyG
from torch_geometric.datasets import PPI
from torch_geometric.loader import DataLoader
train_dataset = PPI(root='/tmp/PPI', split='train') # 20 graphs (24 total)
val_dataset = PPI(root='/tmp/PPI', split='val') # 2 graphs
test_dataset = PPI(root='/tmp/PPI', split='test') # 2 graphs
print(f"Train graphs: {len(train_dataset)}") # 20
print(f"Labels per node: {train_dataset[0].y.shape[1]}") # 121
train_loader = DataLoader(train_dataset, batch_size=2)Separate train/val/test graphs. Use BCEWithLogitsLoss for multi-label classification.
Common tasks and benchmarks
Inductive multi-label node classification with micro-averaged F1. GCN: ~50% (poor without attention). GraphSAGE: ~61.2%. GAT: ~97.3%. GATv2: ~97.8%. The attention-based methods dominate because selective neighbor aggregation is essential for multi-label tasks where different neighbors carry different label signals.
Example: drug target identification
Pharmaceutical companies use PPI networks to identify drug targets. If a disease-associated protein interacts with a druggable protein, targeting that interaction could treat the disease. GNNs on PPI graphs predict which proteins have which functions (the 121 labels), helping prioritize which interactions to investigate. This graph-based target identification has contributed to real drug discovery programs at companies like AstraZeneca and Novartis.
Published benchmark results
Micro-averaged F1 score on the PPI test split (2 held-out graphs). Higher is better.
| Method | Micro-F1 (%) | Year | Paper |
|---|---|---|---|
| GraphSAGE | 61.2 | 2017 | Hamilton et al. |
| GAT | 97.3 | 2018 | Velickovic et al. |
| GATv2 | ~97.8 | 2022 | Brody et al. |
| GCN (3-layer) | ~50 | 2017 | Kipf & Welling |
| JK-Net | ~97.6 | 2018 | Xu et al. |
Original Paper
Predicting Multicellular Function through Multi-layer Tissue Networks
Marinka Zitnik, Jure Leskovec (2017). Bioinformatics, 33(14), i190-i198
Read paper →Original data source
The PPI dataset was introduced by Hamilton et al. (2017) using data from the Molecular Signatures Database and the BioGRID protein interaction repository. The processed version used by PyG is available from the GraphSAGE project page.
@inproceedings{hamilton2017inductive,
title={Inductive Representation Learning on Large Graphs},
author={Hamilton, William L and Ying, Rex and Leskovec, Jure},
booktitle={NeurIPS},
pages={1024--1034},
year={2017}
}BibTeX citation for the PPI benchmark as used in the GraphSAGE paper.
Which dataset should I use?
PPI vs Cora: PPI is inductive (separate train/test graphs), multi-label (121 targets), and multi-graph (24 graphs). Cora is transductive, single-label, single-graph. Use PPI to test inductive generalization; use Cora for transductive baselines.
PPI vs Yelp: Both are multi-label node classification. PPI has 24 small graphs (~2,000 nodes each); Yelp is a single 716K-node graph. PPI tests inductive learning; Yelp tests transductive scaling.
PPI vs PROTEINS: PPI is node classification on interaction networks. PROTEINS is graph classification on protein structures. Different tasks on different biological data.
From benchmark to production
Production PPI analysis uses tissue-specific interaction networks, disease-context interactions, drug-target edges, and temporal expression data. The graphs are larger (human PPI networks have ~20K proteins), the labels are richer (disease associations, druggability scores), and the task includes link prediction (which proteins will interact?) alongside node classification.