What is a knowledge graph?

A knowledge graph is a graph of entities connected by typed relationships, stored as (subject, predicate, object) triples. For example: (Aspirin, treats, Headache), (Aspirin, manufactured_by, Bayer). Knowledge graphs encode structured factual knowledge and enable reasoning over incomplete data.

How do GNNs work on knowledge graphs?

GNNs on knowledge graphs use relation-aware message passing. Each edge type (relation) has its own learned transformation. RGCNConv applies different weight matrices per relation type. Knowledge graph embedding methods like TransE, DistMult, and RotatE learn entity and relation embeddings that score triple plausibility.

What is link prediction on a knowledge graph?

Link prediction is predicting missing triples: given (Aspirin, treats, ?), which entity completes the triple? This is the core task for knowledge graph completion. GNN-based approaches encode entities using their graph neighborhood, then score candidate triples using learned relation transformations.

What is the difference between a knowledge graph and a heterogeneous graph?

A knowledge graph is a specific kind of heterogeneous graph where edges represent typed semantic relationships (predicates) between entities. The focus is on the semantics of relations and reasoning over missing facts. A general heterogeneous graph may not have this semantic structure. In practice, enterprise knowledge graphs are a subset of heterogeneous relational graphs.

How large are enterprise knowledge graphs?

Enterprise knowledge graphs can have millions to billions of entities and triples. Google's Knowledge Graph has billions of facts. Wikidata has over 100 million entities. Enterprise-internal knowledge graphs for drug discovery, supply chain, or customer intelligence typically have millions of entities and tens of millions of relationships.

Knowledge Graphs and GNNs: Entity-Relation Modeling in PyG | Kumo.ai

A knowledge graph is a graph of entities connected by typed relationships. Facts are stored as (subject, predicate, object) triples: (Paris, capital_of, France), (Aspirin, treats, Headache), (Alice, works_at, Acme_Corp). Each entity is a node. Each relation type is a distinct kind of directed edge.

Knowledge graphs are the backbone of structured reasoning systems. Google's Knowledge Graph powers search results. Wikidata stores 100+ million entities. Enterprise knowledge graphs connect products to components to suppliers, drugs to targets to diseases, and customers to events to outcomes.

Structure of a knowledge graph

A knowledge graph has three elements:

Entities: real-world objects (people, drugs, companies, genes)
Relations: typed connections between entities (treats, works_at, inhibits)
Triples: (subject, relation, object) facts that form the graph edges

Unlike a general heterogeneous graph, knowledge graphs emphasize the semantics of relations. The relation type carries meaning: “treats” is fundamentally different from “causes,” even when connecting the same entity types.

GNNs on knowledge graphs

Two main approaches for applying neural networks to knowledge graphs:

Approach 1: Relation-aware GNNs (RGCNConv)

R-GCN applies different learned weight matrices for each relation type during message passing. A drug node receives messages from its “targets” through one set of weights and from its “side_effects” through another:

rgcn_knowledge_graph.py

import torch
from torch_geometric.nn import RGCNConv

class RGCN(torch.nn.Module):
    def __init__(self, num_entities, num_relations, hidden_dim=64):
        super().__init__()
        self.emb = torch.nn.Embedding(num_entities, hidden_dim)
        self.conv1 = RGCNConv(hidden_dim, hidden_dim, num_relations)
        self.conv2 = RGCNConv(hidden_dim, hidden_dim, num_relations)

    def forward(self, edge_index, edge_type):
        x = self.emb.weight
        x = self.conv1(x, edge_index, edge_type).relu()
        x = self.conv2(x, edge_index, edge_type)
        return x  # Entity embeddings

# Score a triple (h, r, t):
# score = entity_emb[h] @ relation_matrix[r] @ entity_emb[t]

R-GCN learns relation-specific transformations. Each relation type has its own weight matrix (or decomposed basis).

Approach 2: Knowledge graph embeddings

Embedding methods learn geometric relationships between entity vectors:

TransE: models relations as translations. h + r should be close to t.
DistMult: models relations as diagonal matrices. Score = h * r * t (element-wise).
RotatE: models relations as rotations in complex space. Captures symmetry, inversion, and composition patterns.
ComplEx: extends DistMult to complex space for asymmetric relations.

Enterprise example: drug discovery

A pharmaceutical company builds a biomedical knowledge graph:

Entities: 50,000 drugs, 20,000 proteins, 10,000 diseases, 5,000 side effects
Relations: treats, targets, inhibits, causes_side_effect, associated_with
Triples: 2 million known facts from clinical databases

Link prediction on this graph answers: “Which existing drugs might treat Disease_X?” (drug repurposing), “Which proteins does Drug_Y likely target?” (mechanism discovery), and “What side effects might Drug_Z cause?” (safety prediction).

After training R-GCN on the known triples, the model ranks all possible (drug, treats, Disease_X) triples by score. Drugs that are structurally similar to known treatments and that target proteins associated with Disease_X will rank highest, even if no direct evidence exists yet.

drug_repurposing.py

# Score all drugs for treating a target disease
disease_id = entity_to_id['Alzheimers']
relation_id = relation_to_id['treats']

# Get all entity embeddings from trained R-GCN
entity_emb = model(edge_index, edge_type)

# Score each drug against the disease
drug_ids = [entity_to_id[d] for d in all_drugs]
scores = []
for drug_id in drug_ids:
    h = entity_emb[drug_id]
    t = entity_emb[disease_id]
    score = (h * relation_emb[relation_id] * t).sum()
    scores.append(score)

# Rank drugs by score -> top candidates for repurposing
top_drugs = sorted(zip(all_drugs, scores), key=lambda x: -x[1])

Drug repurposing as link prediction. The model ranks drugs by their likelihood of treating a target disease.

Benchmark datasets in PyG

PyG includes standard knowledge graph benchmarks:

FB15k-237: 14,541 entities, 237 relations, 310,116 triples (Freebase subset)
WN18RR: 40,943 entities, 11 relations, 93,003 triples (WordNet subset)
YAGO3-10: 123,182 entities, 37 relations, 1,089,040 triples (YAGO subset)

These benchmarks evaluate link prediction: given a triple with one entity masked, rank all entities by predicted score. Metrics include Mean Reciprocal Rank (MRR) and Hits@10.

Key Takeaways

1A knowledge graph stores facts as (subject, relation, object) triples. Entities are nodes, typed relations are edges. The graph is always incomplete, making link prediction the core task.
2R-GCN applies relation-specific weight matrices during message passing. Each relation type transforms neighbor features differently, preserving relational semantics.
3Knowledge graph embedding methods (TransE, RotatE, DistMult) learn geometric relationships between entity vectors. They are fast to train and effective for pure link prediction.
4Enterprise applications include drug repurposing (drug-treats-disease prediction), supply chain intelligence (supplier-component-product reasoning), and customer 360 knowledge graphs.
5PyG provides RGCNConv, standard KG datasets (FB15k-237, WN18RR), and the infrastructure for training and evaluating knowledge graph completion models.

Knowledge Graphs: Entities Connected by Typed Relationships