A knowledge graph is a graph of entities connected by typed relationships. Facts are stored as (subject, predicate, object) triples: (Paris, capital_of, France), (Aspirin, treats, Headache), (Alice, works_at, Acme_Corp). Each entity is a node. Each relation type is a distinct kind of directed edge.
Knowledge graphs are the backbone of structured reasoning systems. Google's Knowledge Graph powers search results. Wikidata stores 100+ million entities. Enterprise knowledge graphs connect products to components to suppliers, drugs to targets to diseases, and customers to events to outcomes.
Structure of a knowledge graph
A knowledge graph has three elements:
- Entities: real-world objects (people, drugs, companies, genes)
- Relations: typed connections between entities (treats, works_at, inhibits)
- Triples: (subject, relation, object) facts that form the graph edges
Unlike a general heterogeneous graph, knowledge graphs emphasize the semantics of relations. The relation type carries meaning: “treats” is fundamentally different from “causes,” even when connecting the same entity types.
GNNs on knowledge graphs
Two main approaches for applying neural networks to knowledge graphs:
Approach 1: Relation-aware GNNs (RGCNConv)
R-GCN applies different learned weight matrices for each relation type during message passing. A drug node receives messages from its “targets” through one set of weights and from its “side_effects” through another:
import torch
from torch_geometric.nn import RGCNConv
class RGCN(torch.nn.Module):
def __init__(self, num_entities, num_relations, hidden_dim=64):
super().__init__()
self.emb = torch.nn.Embedding(num_entities, hidden_dim)
self.conv1 = RGCNConv(hidden_dim, hidden_dim, num_relations)
self.conv2 = RGCNConv(hidden_dim, hidden_dim, num_relations)
def forward(self, edge_index, edge_type):
x = self.emb.weight
x = self.conv1(x, edge_index, edge_type).relu()
x = self.conv2(x, edge_index, edge_type)
return x # Entity embeddings
# Score a triple (h, r, t):
# score = entity_emb[h] @ relation_matrix[r] @ entity_emb[t]R-GCN learns relation-specific transformations. Each relation type has its own weight matrix (or decomposed basis).
Approach 2: Knowledge graph embeddings
Embedding methods learn geometric relationships between entity vectors:
- TransE: models relations as translations. h + r should be close to t.
- DistMult: models relations as diagonal matrices. Score = h * r * t (element-wise).
- RotatE: models relations as rotations in complex space. Captures symmetry, inversion, and composition patterns.
- ComplEx: extends DistMult to complex space for asymmetric relations.
Enterprise example: drug discovery
A pharmaceutical company builds a biomedical knowledge graph:
- Entities: 50,000 drugs, 20,000 proteins, 10,000 diseases, 5,000 side effects
- Relations: treats, targets, inhibits, causes_side_effect, associated_with
- Triples: 2 million known facts from clinical databases
Link prediction on this graph answers: “Which existing drugs might treat Disease_X?” (drug repurposing), “Which proteins does Drug_Y likely target?” (mechanism discovery), and “What side effects might Drug_Z cause?” (safety prediction).
After training R-GCN on the known triples, the model ranks all possible (drug, treats, Disease_X) triples by score. Drugs that are structurally similar to known treatments and that target proteins associated with Disease_X will rank highest, even if no direct evidence exists yet.
# Score all drugs for treating a target disease
disease_id = entity_to_id['Alzheimers']
relation_id = relation_to_id['treats']
# Get all entity embeddings from trained R-GCN
entity_emb = model(edge_index, edge_type)
# Score each drug against the disease
drug_ids = [entity_to_id[d] for d in all_drugs]
scores = []
for drug_id in drug_ids:
h = entity_emb[drug_id]
t = entity_emb[disease_id]
score = (h * relation_emb[relation_id] * t).sum()
scores.append(score)
# Rank drugs by score -> top candidates for repurposing
top_drugs = sorted(zip(all_drugs, scores), key=lambda x: -x[1])Drug repurposing as link prediction. The model ranks drugs by their likelihood of treating a target disease.
Benchmark datasets in PyG
PyG includes standard knowledge graph benchmarks:
- FB15k-237: 14,541 entities, 237 relations, 310,116 triples (Freebase subset)
- WN18RR: 40,943 entities, 11 relations, 93,003 triples (WordNet subset)
- YAGO3-10: 123,182 entities, 37 relations, 1,089,040 triples (YAGO subset)
These benchmarks evaluate link prediction: given a triple with one entity masked, rank all entities by predicted score. Metrics include Mean Reciprocal Rank (MRR) and Hits@10.