What is knowledge graph completion?

Knowledge graph completion is the task of predicting missing triples (head, relation, tail) in a knowledge graph. For example, given that (Einstein, bornIn, Ulm) and (Ulm, locatedIn, Germany) exist, the model should predict (Einstein, nationality, German). The knowledge graph is always incomplete because no manual curation process can capture every fact.

How is knowledge graph completion used in enterprises?

Enterprises use knowledge graph completion for product knowledge graphs (predicting missing product attributes, categories, and relationships), customer knowledge graphs (inferring customer preferences from partial interaction data), drug knowledge graphs (predicting drug-gene-disease interactions), and internal knowledge management (connecting documents, experts, projects, and skills).

What are the standard benchmarks for knowledge graph completion?

The standard benchmarks are FB15k-237 (derived from Freebase, 14,541 entities, 237 relations, 310K triples), WN18RR (derived from WordNet, 40,943 entities, 11 relations, 93K triples), and YAGO3-10 (derived from YAGO, 123,182 entities, 37 relations, 1.1M triples). Models are evaluated by Mean Reciprocal Rank (MRR) and Hits@K (fraction of correct entities ranked in top K).

Knowledge Graph Completion: Predicting Missing Facts with GNNs | Kumo.ai

Q: What is the difference between knowledge graph embedding and GNN approaches?

Embedding methods (TransE, RotatE, ComplEx) learn static vectors for entities and relations, scoring triples by geometric operations (translation, rotation). GNN approaches (CompGCN, R-GCN) learn entity embeddings through message passing over the knowledge graph structure, incorporating multi-hop neighborhood information. GNN approaches generally perform better on inductive settings (new entities) while embedding methods are faster to train.

Knowledge graph completion predicts missing facts in knowledge bases by learning patterns from existing entity-relation-entity triples. A knowledge graph stores structured facts as triples: (Albert Einstein, bornIn, Ulm), (Ulm, locatedIn, Germany), (Einstein, field, Physics). No knowledge graph is complete. Freebase has millions of entities but most have only a few facts. Enterprise product catalogs have thousands of missing attributes. Completion models predict these gaps.

The link prediction formulation

Given a knowledge graph with entities E and relations R, knowledge graph completion is a link prediction task: given (head, relation, ?), predict the missing tail entity. Or given (?, relation, tail), predict the missing head.

The model learns to score every possible triple. For the query (Einstein, nationality, ?), it should rank “German” higher than “French” or “Brazilian”. Training uses existing triples as positive examples and corrupted triples (replacing head or tail with random entities) as negatives.

Embedding-based methods

The first generation of knowledge graph completion methods learn static embedding vectors for every entity and relation:

TransE: relations as translations

TransE models each relation as a translation vector. For a valid triple (h, r, t), the model enforces: h + r is approximately equal to t in embedding space. If Einstein's embedding plus the “bornIn” vector lands near Ulm's embedding, the triple is scored highly.

RotatE: relations as rotations

RotatE uses complex-valued embeddings and models relations as rotations. This handles patterns that TransE cannot: symmetric relations (married-to), inverse relations (bornIn/birthplaceOf), and composed relations (bornIn + locatedIn = nationality).

knowledge_graph_embedding.py

import torch
from torch_geometric.nn import TransE, RotatE

# TransE: h + r ≈ t
model = TransE(
    num_nodes=14541,      # entities in FB15k-237
    num_relations=237,     # relation types
    hidden_channels=256,
)

# Score a batch of triples
head_index = torch.tensor([0, 1, 2])
rel_type = torch.tensor([5, 12, 5])
tail_index = torch.tensor([100, 200, 300])
score = model(head_index, rel_type, tail_index)
# Lower score = more plausible triple

PyG provides implementations of TransE, RotatE, and other knowledge graph embedding methods with training utilities.

GNN-based methods

Embedding methods assign a fixed vector to each entity regardless of context. GNN-based methods improve on this by computing entity embeddings through message passing over the knowledge graph structure:

R-GCN (Relational Graph Convolutional Network): uses relation-specific weight matrices in message passing. Each relation type has its own transformation, allowing the model to learn different propagation patterns for different relation types.
CompGCN: jointly embeds entities and relations during message passing. The relation embedding modifies how messages are computed, enabling composition of relations across multiple hops.

Relational patterns and scoring functions

Different scoring functions capture different relational patterns:

Symmetry (married-to): if (A, r, B) then (B, r, A). RotatE handles this with 180-degree rotation; TransE cannot.
Antisymmetry (parent-of): if (A, r, B) then NOT (B, r, A). TransE handles this naturally.
Inversion (bornIn/birthplaceOf): r1 is the inverse of r2. RotatE models this as opposite rotations.
Composition (bornIn + locatedIn = nationality): RotatE composes rotations; ComplEx handles this through bilinear scoring.

Enterprise applications

Knowledge graph completion has immediate business value:

Product knowledge graphs: an e-commerce catalog with 10 million products where 60% have missing attributes. Completion predicts (ProductX, hasBrand, ?), (ProductX, compatibleWith, ?), and (ProductX, inCategory, ?).
Customer knowledge graphs: inferring customer preferences from partial interaction data. If a customer bought running shoes and a fitness tracker, predict (Customer, interestedIn, Marathon Training).
Drug discovery: predicting drug-gene-disease interactions. If DrugA targets GeneB and GeneB is implicated in DiseaseC, predict (DrugA, treats, DiseaseC).
Internal knowledge management: connecting employees to skills, projects, and documents. Predict (Employee, expertIn, ?) to route questions to the right expert.

Key Takeaways

1Knowledge graph completion predicts missing (head, relation, tail) triples. Every knowledge graph is incomplete; completion fills the gaps by learning patterns from existing facts.
2Embedding methods (TransE, RotatE) learn fixed vectors for entities and relations. TransE uses translation (h + r = t), RotatE uses rotation in complex space to capture symmetry, inversion, and composition patterns.
3GNN methods (R-GCN, CompGCN) compute entity embeddings through message passing, incorporating multi-hop neighborhood information. They handle new entities without retraining, critical for dynamic enterprise knowledge graphs.
4The choice of scoring function determines which relational patterns the model can learn. RotatE handles symmetry, antisymmetry, inversion, and composition; TransE is limited to antisymmetric relations.
5Enterprise applications include completing product catalogs (missing attributes), customer knowledge graphs (inferred preferences), and drug knowledge graphs (predicted drug-gene-disease interactions).

Knowledge Graph Completion: Predicting Missing Facts in Knowledge Bases

The link prediction formulation

Embedding-based methods

TransE: relations as translations

RotatE: relations as rotations

GNN-based methods

Relational patterns and scoring functions

Enterprise applications

Frequently asked questions

What is knowledge graph completion?

What is the difference between knowledge graph embedding and GNN approaches?

How is knowledge graph completion used in enterprises?

What are the standard benchmarks for knowledge graph completion?

Related

From the Kumo Learn Hub

Learn more about graph ML