How does GNN-based entity resolution differ from traditional record linkage?

Traditional record linkage compares pairs of records using string similarity (Jaro-Winkler, edit distance). GNNs go further by propagating match signals through the graph: if record A matches B, and B matches C via shared attributes, then A likely matches C even without direct similarity. This transitive reasoning catches matches that pairwise comparison misses.

What graph structure is used for entity resolution?

Records become nodes with attribute features (name embedding, address embedding, etc.). Edges connect records that share blocking keys (same zip code, similar name prefix, shared phone number). The GNN then predicts which edges represent true matches vs coincidental similarities.

How do you scale entity resolution to millions of records?

Use blocking to reduce the candidate pair space from O(n^2) to manageable size. Common blocking keys: first 3 characters of last name + zip code, phone number, email domain. Build the graph only on candidate pairs, then run GNN link prediction. PyG's NeighborLoader handles mini-batch training on the resulting graph.

Can GNNs resolve entities across different schemas?

Yes. Encode attributes from different sources into a shared embedding space (using pre-trained language models for text fields). The GNN then operates on these normalized embeddings regardless of the source schema. This handles cases where one source has 'first_name' and another has 'full_name'.

How does KumoRFM handle entity resolution?

KumoRFM can predict link likelihood between records from different tables using a single PQL query. It automatically constructs the similarity graph, learns which attribute combinations indicate matches, and outputs match probabilities for each candidate pair.

Entity Resolution with PyG: Link Prediction for Record Matching | PyG Guide

The business problem

IBM estimates that bad data costs the US economy $3.1 trillion annually. A major driver: duplicate and fragmented records across systems. The same customer appears as “John Smith” in the CRM, “J. Smith” in the billing system, and “Jonathan Smith” in the support database. Without resolving these entities, analytics are wrong, campaigns are wasted, and customer experience suffers.

Traditional entity resolution uses pairwise string similarity: compare name, address, and phone number between records, score each pair, and threshold. This works for exact and near-exact matches but fails on partial information, schema differences, and transitive relationships.

Why flat ML fails

No transitivity: If record A shares an email with B, and B shares an address with C, then A and C are likely the same entity. Pairwise models compare A-C directly and see no match.
Schema mismatch: Different sources encode attributes differently. One has “full_name”, another has “first_name” + “last_name”. Feature engineering for every schema pair does not scale.
Blocking limitations: Traditional blocking keys miss fuzzy matches. A graph approach lets evidence propagate: similar records become neighbors and reinforce each other.
Scale: N records create O(N^2) candidate pairs. Flat models evaluate each pair independently. GNNs evaluate them jointly, using neighbor context to disambiguate.

The relational schema

schema.txt

Node types:
  Record  (id, source, name_emb, addr_emb, phone_hash, email_hash)

Edge types:
  Record --[same_phone]-->    Record
  Record --[same_email]-->    Record
  Record --[similar_name]-->  Record  (jaro_winkler_score)
  Record --[same_zip]-->      Record
  Record --[same_company]-->  Record

Records from multiple sources become nodes. Blocking keys create candidate edges. The GNN predicts which edges are true entity matches.

PyG architecture: SAGEConv + link prediction

entity_resolution_model.py

import torch
import torch.nn.functional as F
from torch_geometric.nn import SAGEConv, Linear

class EntityResolutionGNN(torch.nn.Module):
    def __init__(self, in_dim, hidden_dim=128):
        super().__init__()
        self.lin = Linear(in_dim, hidden_dim)
        self.conv1 = SAGEConv(hidden_dim, hidden_dim)
        self.conv2 = SAGEConv(hidden_dim, hidden_dim)

    def encode(self, x, edge_index):
        x = F.relu(self.lin(x))
        x = F.relu(self.conv1(x, edge_index))
        x = self.conv2(x, edge_index)
        return F.normalize(x, dim=-1)

    def predict_link(self, z, edge_label_index):
        src, dst = edge_label_index
        return (z[src] * z[dst]).sum(dim=-1)

    def forward(self, x, edge_index, pos_edges, neg_edges):
        z = self.encode(x, edge_index)
        pos_score = self.predict_link(z, pos_edges)
        neg_score = self.predict_link(z, neg_edges)

        pos_loss = -F.logsigmoid(pos_score).mean()
        neg_loss = -F.logsigmoid(-neg_score).mean()
        return pos_loss + neg_loss

# Inference: score all candidate pairs
model.eval()
z = model.encode(data.x, data.edge_index)
scores = model.predict_link(z, candidate_edges)
matches = candidate_edges[:, scores > threshold]

SAGEConv encodes records using neighbor context. Link prediction scores candidate pairs. Transitive signals flow through the graph: shared neighbors increase match probability.

Training considerations

Blocking: Generate candidate pairs using multiple blocking keys (phone, email, name prefix + zip). Union the blocks to increase recall. The graph is built on these candidates.
Negative sampling: Sample hard negatives from the same block (records that share a blocking key but are different entities). Easy negatives (random pairs) do not teach the model much.
Feature encoding: Use pre-trained sentence transformers to encode name and address fields. Hash phone and email for exact-match features. Concatenate all as the node feature vector.
Connected components: After scoring, extract connected components from the match graph. Each component is a resolved entity. Handle conflicts (A matches B, A matches C, but B and C do not match) with cluster-level optimization.

Expected performance

String similarity (Jaro-Winkler): ~55 F1
LightGBM (pairwise features): 62.44 AUROC
GNN (SAGEConv link prediction): 75.83 AUROC
KumoRFM (zero-shot): 76.71 AUROC

Or use KumoRFM in one line

KumoRFM PQL

PREDICT is_same_entity FOR record_a, record_b
USING records, attributes, interactions

One PQL query. KumoRFM handles blocking, graph construction, and match prediction automatically.

KumoRFM replaces blocking strategy design, graph construction, model training, and threshold tuning with a single query. It achieves 76.71 AUROC zero-shot, capturing transitive match signals automatically through its pre-trained graph transformer.

Entity Resolution: Link Prediction for Record Matching