Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Dataset7 min read

FB15k-237: The Standard Benchmark for Knowledge Graph Link Prediction

FB15k-237 is a Freebase knowledge graph with 14,541 entities connected by 310,116 triples across 237 relation types. It tests whether models can predict missing facts in a knowledge graph -- the foundation of question answering, recommendation, and data completion systems.

PyTorch Geometric

TL;DR

  • 1FB15k-237 has 14,541 entities, 310,116 edges, and 237 relation types. The task is link prediction: given (entity, relation, ?), predict the missing entity.
  • 2It fixes FB15k's data leakage from inverse relations. FB15k-237 is harder and more realistic, making it the standard knowledge graph benchmark.
  • 3Evaluation uses MRR (Mean Reciprocal Rank) and Hits@K. The model must rank the correct answer among all 14,541 possible entities.
  • 4Knowledge graph link prediction powers search, recommendations, and data completion across industries.

14,541

Entities

310,116

Edges

237

Relations

Link Prediction

Task

What FB15k-237 contains

FB15k-237 is a subset of Freebase, a large collaborative knowledge graph that Google acquired in 2010. The dataset contains 14,541 entities (people, places, organizations, concepts) connected by 310,116 triples of the form (head, relation, tail). The 237 relation types include “born_in,” “directed_by,” “nationality,” “genre,” and hundreds more.

The “237” in the name refers to the number of relation types after removing inverse relations from the original FB15k. This removal was crucial: in FB15k, models could achieve high accuracy by simply memorizing that (A, born_in, B) implies (B, birthplace_of, A). FB15k-237 eliminates this shortcut, requiring models to learn genuine relational patterns.

Why FB15k-237 matters

Link prediction on knowledge graphs is one of the most commercially important tasks in graph ML. Google uses it to answer questions (“Who directed Inception?” requires predicting (Inception, directed_by, ?)). Amazon uses it to infer product attributes. Enterprise knowledge graphs use it to complete missing data (if a company has a CEO and headquarters, predict its industry).

FB15k-237 provides a controlled benchmark for this task. The 237 relation types create a rich multi-relational graph where the model must learn type-specific patterns: the relation “born_in” connects people to locations, “directed_by” connects films to people. Models that capture these type constraints perform best.

Loading FB15k-237 in PyG

load_fb15k237.py
from torch_geometric.datasets import FB15k_237

dataset = FB15k_237(root='/tmp/FB15k237')
data = dataset[0]

print(f"Entities: {data.num_nodes}")     # 14541
print(f"Triples: {data.num_edges}")      # 310116
print(f"Relations: {data.edge_type.max() + 1}")  # 237

# Access train/val/test edge splits
print(f"Train edges: {data.train_mask.sum()}")
# Evaluation: rank correct tail among all entities

Each triple has head, relation, and tail indices. Use knowledge graph embedding frameworks for training.

Common tasks and benchmarks

Link prediction with filtered MRR and Hits@K. For each test triple (h, r, ?), score all possible tails and report the rank of the correct one. TransE: ~0.294 MRR. ComplEx: ~0.247. RotatE: ~0.338. R-GCN: ~0.249. CompGCN: ~0.355. The multi-relational structure benefits methods (RotatE, CompGCN) that model relation-specific transformations in embedding space.

Example: enterprise data completion

A company's CRM contains partial data: some contacts have company names but no industry, some have titles but no department. The CRM forms a knowledge graph. Link prediction fills gaps: given (Contact, works_at, Acme Corp) and (Acme Corp, industry, ?), predict the industry. This automated data completion improves lead scoring, segmentation, and reporting quality without manual data entry.

Published benchmark results

Link prediction on FB15k-237. Filtered MRR and Hits@10. Higher is better for both metrics.

MethodMRRHits@10YearPaper
TransE0.2940.4652013Bordes et al.
DistMult0.2410.4192015Yang et al.
ComplEx0.2470.4282016Trouillon et al.
RotatE0.3380.5332019Sun et al.
R-GCN0.2490.4172018Schlichtkrull et al.
CompGCN0.3550.5352020Vashishth et al.

Original Paper

Observed versus Latent Features for Knowledge Base and Text Inference

Kristina Toutanova, Danqi Chen (2015). 3rd Workshop on Continuous Vector Space Models and their Compositionality

Read paper →

Original data source

FB15k-237 was created by Toutanova and Chen (2015) by removing inverse relations from FB15k. The dataset is available from Microsoft Research. The original Freebase data is from Google's Freebase project.

cite_fb15k237.bib
@inproceedings{toutanova2015observed,
  title={Observed versus Latent Features for Knowledge Base and Text Inference},
  author={Toutanova, Kristina and Chen, Danqi},
  booktitle={3rd Workshop on Continuous Vector Space Models and their Compositionality},
  pages={57--66},
  year={2015}
}

BibTeX citation for the FB15k-237 dataset.

Which dataset should I use?

FB15k-237 vs FB15k: Always use FB15k-237. FB15k has inverse relation leakage that inflates scores artificially. FB15k-237 fixes this and is the accepted standard.

FB15k-237 vs WN18RR: Both are standard KG benchmarks. FB15k-237 (Freebase) has more relation types (237 vs 11) and is more diverse. WN18RR (WordNet) tests hierarchical reasoning. Most papers report on both.

FB15k-237 vs NELL: FB15k-237 is for link prediction. NELL is for entity typing (node classification). Different tasks on knowledge graphs.

From benchmark to production

Production knowledge graphs have millions of entities, thousands of relation types, and temporal dynamics (facts change: CEOs leave, companies merge). They also require reasoning chains: combining multiple triples to infer new facts. FB15k-237 tests single-hop prediction; production systems need multi-hop reasoning.

Frequently asked questions

What is the FB15k-237 dataset?

FB15k-237 is a subset of the Freebase knowledge graph with 14,541 entities (nodes), 310,116 triples (edges), and 237 relation types. It was created by removing inverse relations from FB15k to prevent data leakage. The task is link prediction: given (head, relation, ?), predict the missing tail entity.

How does FB15k-237 differ from FB15k?

FB15k-237 removes inverse relations that made FB15k artificially easy. In FB15k, if (A, born_in, B) exists in training, (B, place_of_birth, A) might appear in the test set, making prediction trivial. FB15k-237 removes these inverse pairs, resulting in a harder, more realistic benchmark.

How do I load FB15k-237 in PyTorch Geometric?

Use `from torch_geometric.datasets import FB15k_237; dataset = FB15k_237(root='/tmp/FB15k')`. The dataset provides train/val/test splits of triples. Each triple has (head, relation, tail) indices.

What metrics are used for FB15k-237?

Mean Reciprocal Rank (MRR) and Hits@K (K=1,3,10). For each test triple (h, r, ?), rank all possible tail entities and report where the correct tail appears. MRR is the average of 1/rank. Hits@10 is the fraction where the correct answer is in the top 10.

What models work best on FB15k-237?

TransE, ComplEx, and RotatE are classic embedding approaches. R-GCN and CompGCN are GNN-based methods that learn entity embeddings via message passing over the knowledge graph. GNN methods can incorporate node features but embedding methods often match or exceed them on FB15k-237.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.