Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Guide8 min read

Knowledge Graph Completion: Predicting Missing Facts in Knowledge Bases

Every knowledge graph is incomplete. Wikipedia, Freebase, enterprise product catalogs: they all have missing facts. Knowledge graph completion predicts these missing links by learning patterns in existing entity-relation-entity triples.

PyTorch Geometric

TL;DR

  • 1Knowledge graphs store facts as (head, relation, tail) triples: (Einstein, bornIn, Ulm). They are always incomplete because no curation process captures every fact. Completion predicts the missing triples.
  • 2Embedding methods (TransE, RotatE, ComplEx) learn vectors for entities and relations. TransE models relations as translations: head + relation should equal tail. RotatE models relations as rotations in complex space.
  • 3GNN approaches (R-GCN, CompGCN) learn entity embeddings through message passing over the knowledge graph. Each entity's embedding incorporates its multi-hop neighborhood, enabling better generalization to new entities.
  • 4The scoring function determines how triples are ranked. Distance-based (TransE), rotation-based (RotatE), and bilinear (ComplEx) scoring functions capture different relational patterns: symmetry, composition, inversion.
  • 5Enterprise applications: completing product knowledge graphs (missing attributes), customer knowledge graphs (inferred preferences), and drug knowledge graphs (predicted drug-gene interactions).

Knowledge graph completion predicts missing facts in knowledge bases by learning patterns from existing entity-relation-entity triples. A knowledge graph stores structured facts as triples: (Albert Einstein, bornIn, Ulm), (Ulm, locatedIn, Germany), (Einstein, field, Physics). No knowledge graph is complete. Freebase has millions of entities but most have only a few facts. Enterprise product catalogs have thousands of missing attributes. Completion models predict these gaps.

The link prediction formulation

Given a knowledge graph with entities E and relations R, knowledge graph completion is a link prediction task: given (head, relation, ?), predict the missing tail entity. Or given (?, relation, tail), predict the missing head.

The model learns to score every possible triple. For the query (Einstein, nationality, ?), it should rank “German” higher than “French” or “Brazilian”. Training uses existing triples as positive examples and corrupted triples (replacing head or tail with random entities) as negatives.

Embedding-based methods

The first generation of knowledge graph completion methods learn static embedding vectors for every entity and relation:

TransE: relations as translations

TransE models each relation as a translation vector. For a valid triple (h, r, t), the model enforces: h + r is approximately equal to t in embedding space. If Einstein's embedding plus the “bornIn” vector lands near Ulm's embedding, the triple is scored highly.

RotatE: relations as rotations

RotatE uses complex-valued embeddings and models relations as rotations. This handles patterns that TransE cannot: symmetric relations (married-to), inverse relations (bornIn/birthplaceOf), and composed relations (bornIn + locatedIn = nationality).

knowledge_graph_embedding.py
import torch
from torch_geometric.nn import TransE, RotatE

# TransE: h + r ≈ t
model = TransE(
    num_nodes=14541,      # entities in FB15k-237
    num_relations=237,     # relation types
    hidden_channels=256,
)

# Score a batch of triples
head_index = torch.tensor([0, 1, 2])
rel_type = torch.tensor([5, 12, 5])
tail_index = torch.tensor([100, 200, 300])
score = model(head_index, rel_type, tail_index)
# Lower score = more plausible triple

PyG provides implementations of TransE, RotatE, and other knowledge graph embedding methods with training utilities.

GNN-based methods

Embedding methods assign a fixed vector to each entity regardless of context. GNN-based methods improve on this by computing entity embeddings through message passing over the knowledge graph structure:

  • R-GCN (Relational Graph Convolutional Network): uses relation-specific weight matrices in message passing. Each relation type has its own transformation, allowing the model to learn different propagation patterns for different relation types.
  • CompGCN: jointly embeds entities and relations during message passing. The relation embedding modifies how messages are computed, enabling composition of relations across multiple hops.

Relational patterns and scoring functions

Different scoring functions capture different relational patterns:

  • Symmetry (married-to): if (A, r, B) then (B, r, A). RotatE handles this with 180-degree rotation; TransE cannot.
  • Antisymmetry (parent-of): if (A, r, B) then NOT (B, r, A). TransE handles this naturally.
  • Inversion (bornIn/birthplaceOf): r1 is the inverse of r2. RotatE models this as opposite rotations.
  • Composition (bornIn + locatedIn = nationality): RotatE composes rotations; ComplEx handles this through bilinear scoring.

Enterprise applications

Knowledge graph completion has immediate business value:

  • Product knowledge graphs: an e-commerce catalog with 10 million products where 60% have missing attributes. Completion predicts (ProductX, hasBrand, ?), (ProductX, compatibleWith, ?), and (ProductX, inCategory, ?).
  • Customer knowledge graphs: inferring customer preferences from partial interaction data. If a customer bought running shoes and a fitness tracker, predict (Customer, interestedIn, Marathon Training).
  • Drug discovery: predicting drug-gene-disease interactions. If DrugA targets GeneB and GeneB is implicated in DiseaseC, predict (DrugA, treats, DiseaseC).
  • Internal knowledge management: connecting employees to skills, projects, and documents. Predict (Employee, expertIn, ?) to route questions to the right expert.

Frequently asked questions

What is knowledge graph completion?

Knowledge graph completion is the task of predicting missing triples (head, relation, tail) in a knowledge graph. For example, given that (Einstein, bornIn, Ulm) and (Ulm, locatedIn, Germany) exist, the model should predict (Einstein, nationality, German). The knowledge graph is always incomplete because no manual curation process can capture every fact.

What is the difference between knowledge graph embedding and GNN approaches?

Embedding methods (TransE, RotatE, ComplEx) learn static vectors for entities and relations, scoring triples by geometric operations (translation, rotation). GNN approaches (CompGCN, R-GCN) learn entity embeddings through message passing over the knowledge graph structure, incorporating multi-hop neighborhood information. GNN approaches generally perform better on inductive settings (new entities) while embedding methods are faster to train.

How is knowledge graph completion used in enterprises?

Enterprises use knowledge graph completion for product knowledge graphs (predicting missing product attributes, categories, and relationships), customer knowledge graphs (inferring customer preferences from partial interaction data), drug knowledge graphs (predicting drug-gene-disease interactions), and internal knowledge management (connecting documents, experts, projects, and skills).

What are the standard benchmarks for knowledge graph completion?

The standard benchmarks are FB15k-237 (derived from Freebase, 14,541 entities, 237 relations, 310K triples), WN18RR (derived from WordNet, 40,943 entities, 11 relations, 93K triples), and YAGO3-10 (derived from YAGO, 123,182 entities, 37 relations, 1.1M triples). Models are evaluated by Mean Reciprocal Rank (MRR) and Hits@K (fraction of correct entities ranked in top K).

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.