Transfer learning applies GNN knowledge learned on one graph to improve performance on a different graph. A GNN pre-trained on 2 million molecules from ChEMBL learns general chemical patterns: what ring structures mean, how functional groups affect properties, how molecular size relates to solubility. When fine-tuned on your proprietary dataset of 500 drug candidates, this pre-trained knowledge gives a massive head start compared to training from scratch.
This is the same principle that makes ImageNet pre-training valuable for medical imaging and BERT pre-training valuable for legal text. The key question is: what transfers between graphs?
What transfers
- Aggregation patterns: how to combine neighbor information effectively (transfers across all graph types)
- Structural motifs: triangles, rings, cliques, stars, and their significance (transfers within domains)
- Feature interactions: how node features combine with neighborhood structure (transfers between similar feature spaces)
- Scale patterns: how graph size, density, and degree distributions correlate with predictions (transfers broadly)
Three transfer approaches
import torch
from torch_geometric.nn import GCNConv
# Pre-trained encoder (from large source dataset)
pretrained_encoder = load_pretrained_gnn('molecular_gnn.pt')
# Approach 1: Fine-tuning (update everything)
model = FineTuneModel(pretrained_encoder, num_target_classes=2)
# Train with small learning rate (1e-4) on target data
# All encoder weights update gradually
# Approach 2: Feature extraction (freeze encoder)
for param in pretrained_encoder.parameters():
param.requires_grad = False
classifier = torch.nn.Linear(hidden_dim, num_target_classes)
# Only train the classifier on target data
# Approach 3: Adapter (freeze most, add small trainable modules)
class Adapter(torch.nn.Module):
def __init__(self, hidden_dim, bottleneck=16):
super().__init__()
self.down = torch.nn.Linear(hidden_dim, bottleneck)
self.up = torch.nn.Linear(bottleneck, hidden_dim)
def forward(self, x):
return x + self.up(self.down(x).relu())
# Insert adapter after each frozen layerThree transfer strategies. Fine-tuning is default. Feature extraction for very small target sets. Adapters for efficient multi-task transfer.
Enterprise example: cross-company fraud patterns
A fintech startup has 100,000 transactions and 200 confirmed fraud cases. Training a GNN from scratch on this small dataset overfits quickly. Transfer learning:
- Pre-train a GNN on a large public transaction dataset (or a synthetic dataset with known fraud patterns)
- The model learns general fraud patterns: unusual degree distributions, temporal velocity anomalies, fan-out/fan-in structures
- Fine-tune on the startup's 100,000 transactions with 200 labels
- The transferred model achieves 85% AUROC vs 65% for training from scratch
The improvement comes from the pre-trained model already understanding what “suspicious graph structure” looks like. Fine-tuning just adapts this understanding to the specific transaction patterns of the startup.
When transfer fails
- Domain mismatch: transferring molecular patterns to social networks. The structural patterns are too different.
- Feature mismatch: source and target have completely different feature semantics. Feature extraction still works (ignore features, transfer structure knowledge).
- Scale mismatch: pre-training on small graphs, deploying on enormous ones. The model may not have learned patterns relevant to large-scale structure.