Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Use Case11 min read

Entity Resolution: Link Prediction for Record Matching

Duplicate and fragmented records cost enterprises $3 trillion annually in bad decisions. Pairwise string matching misses transitive matches. Here is how to build a GNN that resolves entities across databases using graph-based link prediction.

PyTorch Geometric

TL;DR

  • 1Entity resolution is a link prediction problem on a record similarity graph. Nodes are records, edges connect candidates that share blocking keys, and the GNN predicts which edges are true matches.
  • 2GNNs capture transitive matches: if A matches B and B matches C via shared attributes, the GNN infers A matches C even without direct similarity. Pairwise methods miss this.
  • 3On RelBench benchmarks, GNNs achieve 75.83 AUROC vs 62.44 for flat-table methods. The transitive signal provides 13+ points of lift over pairwise comparison.
  • 4The PyG model uses SAGEConv for message passing and a link prediction head. ~35 lines of model code, but blocking, candidate generation, and merge logic add significant complexity.
  • 5KumoRFM resolves entities with one PQL query (76.71 AUROC zero-shot), automatically handling blocking, similarity graph construction, and match prediction.

The business problem

IBM estimates that bad data costs the US economy $3.1 trillion annually. A major driver: duplicate and fragmented records across systems. The same customer appears as “John Smith” in the CRM, “J. Smith” in the billing system, and “Jonathan Smith” in the support database. Without resolving these entities, analytics are wrong, campaigns are wasted, and customer experience suffers.

Traditional entity resolution uses pairwise string similarity: compare name, address, and phone number between records, score each pair, and threshold. This works for exact and near-exact matches but fails on partial information, schema differences, and transitive relationships.

Why flat ML fails

  • No transitivity: If record A shares an email with B, and B shares an address with C, then A and C are likely the same entity. Pairwise models compare A-C directly and see no match.
  • Schema mismatch: Different sources encode attributes differently. One has “full_name”, another has “first_name” + “last_name”. Feature engineering for every schema pair does not scale.
  • Blocking limitations: Traditional blocking keys miss fuzzy matches. A graph approach lets evidence propagate: similar records become neighbors and reinforce each other.
  • Scale: N records create O(N^2) candidate pairs. Flat models evaluate each pair independently. GNNs evaluate them jointly, using neighbor context to disambiguate.

The relational schema

schema.txt
Node types:
  Record  (id, source, name_emb, addr_emb, phone_hash, email_hash)

Edge types:
  Record --[same_phone]-->    Record
  Record --[same_email]-->    Record
  Record --[similar_name]-->  Record  (jaro_winkler_score)
  Record --[same_zip]-->      Record
  Record --[same_company]-->  Record

Records from multiple sources become nodes. Blocking keys create candidate edges. The GNN predicts which edges are true entity matches.

PyG architecture: SAGEConv + link prediction

entity_resolution_model.py
import torch
import torch.nn.functional as F
from torch_geometric.nn import SAGEConv, Linear

class EntityResolutionGNN(torch.nn.Module):
    def __init__(self, in_dim, hidden_dim=128):
        super().__init__()
        self.lin = Linear(in_dim, hidden_dim)
        self.conv1 = SAGEConv(hidden_dim, hidden_dim)
        self.conv2 = SAGEConv(hidden_dim, hidden_dim)

    def encode(self, x, edge_index):
        x = F.relu(self.lin(x))
        x = F.relu(self.conv1(x, edge_index))
        x = self.conv2(x, edge_index)
        return F.normalize(x, dim=-1)

    def predict_link(self, z, edge_label_index):
        src, dst = edge_label_index
        return (z[src] * z[dst]).sum(dim=-1)

    def forward(self, x, edge_index, pos_edges, neg_edges):
        z = self.encode(x, edge_index)
        pos_score = self.predict_link(z, pos_edges)
        neg_score = self.predict_link(z, neg_edges)

        pos_loss = -F.logsigmoid(pos_score).mean()
        neg_loss = -F.logsigmoid(-neg_score).mean()
        return pos_loss + neg_loss

# Inference: score all candidate pairs
model.eval()
z = model.encode(data.x, data.edge_index)
scores = model.predict_link(z, candidate_edges)
matches = candidate_edges[:, scores > threshold]

SAGEConv encodes records using neighbor context. Link prediction scores candidate pairs. Transitive signals flow through the graph: shared neighbors increase match probability.

Training considerations

  • Blocking: Generate candidate pairs using multiple blocking keys (phone, email, name prefix + zip). Union the blocks to increase recall. The graph is built on these candidates.
  • Negative sampling: Sample hard negatives from the same block (records that share a blocking key but are different entities). Easy negatives (random pairs) do not teach the model much.
  • Feature encoding: Use pre-trained sentence transformers to encode name and address fields. Hash phone and email for exact-match features. Concatenate all as the node feature vector.
  • Connected components: After scoring, extract connected components from the match graph. Each component is a resolved entity. Handle conflicts (A matches B, A matches C, but B and C do not match) with cluster-level optimization.

Expected performance

  • String similarity (Jaro-Winkler): ~55 F1
  • LightGBM (pairwise features): 62.44 AUROC
  • GNN (SAGEConv link prediction): 75.83 AUROC
  • KumoRFM (zero-shot): 76.71 AUROC

Or use KumoRFM in one line

KumoRFM PQL
PREDICT is_same_entity FOR record_a, record_b
USING records, attributes, interactions

One PQL query. KumoRFM handles blocking, graph construction, and match prediction automatically.

KumoRFM replaces blocking strategy design, graph construction, model training, and threshold tuning with a single query. It achieves 76.71 AUROC zero-shot, capturing transitive match signals automatically through its pre-trained graph transformer.

Frequently asked questions

How does GNN-based entity resolution differ from traditional record linkage?

Traditional record linkage compares pairs of records using string similarity (Jaro-Winkler, edit distance). GNNs go further by propagating match signals through the graph: if record A matches B, and B matches C via shared attributes, then A likely matches C even without direct similarity. This transitive reasoning catches matches that pairwise comparison misses.

What graph structure is used for entity resolution?

Records become nodes with attribute features (name embedding, address embedding, etc.). Edges connect records that share blocking keys (same zip code, similar name prefix, shared phone number). The GNN then predicts which edges represent true matches vs coincidental similarities.

How do you scale entity resolution to millions of records?

Use blocking to reduce the candidate pair space from O(n^2) to manageable size. Common blocking keys: first 3 characters of last name + zip code, phone number, email domain. Build the graph only on candidate pairs, then run GNN link prediction. PyG's NeighborLoader handles mini-batch training on the resulting graph.

Can GNNs resolve entities across different schemas?

Yes. Encode attributes from different sources into a shared embedding space (using pre-trained language models for text fields). The GNN then operates on these normalized embeddings regardless of the source schema. This handles cases where one source has 'first_name' and another has 'full_name'.

How does KumoRFM handle entity resolution?

KumoRFM can predict link likelihood between records from different tables using a single PQL query. It automatically constructs the similarity graph, learns which attribute combinations indicate matches, and outputs match probabilities for each candidate pair.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.