The business problem
IBM estimates that bad data costs the US economy $3.1 trillion annually. A major driver: duplicate and fragmented records across systems. The same customer appears as “John Smith” in the CRM, “J. Smith” in the billing system, and “Jonathan Smith” in the support database. Without resolving these entities, analytics are wrong, campaigns are wasted, and customer experience suffers.
Traditional entity resolution uses pairwise string similarity: compare name, address, and phone number between records, score each pair, and threshold. This works for exact and near-exact matches but fails on partial information, schema differences, and transitive relationships.
Why flat ML fails
- No transitivity: If record A shares an email with B, and B shares an address with C, then A and C are likely the same entity. Pairwise models compare A-C directly and see no match.
- Schema mismatch: Different sources encode attributes differently. One has “full_name”, another has “first_name” + “last_name”. Feature engineering for every schema pair does not scale.
- Blocking limitations: Traditional blocking keys miss fuzzy matches. A graph approach lets evidence propagate: similar records become neighbors and reinforce each other.
- Scale: N records create O(N^2) candidate pairs. Flat models evaluate each pair independently. GNNs evaluate them jointly, using neighbor context to disambiguate.
The relational schema
Node types:
Record (id, source, name_emb, addr_emb, phone_hash, email_hash)
Edge types:
Record --[same_phone]--> Record
Record --[same_email]--> Record
Record --[similar_name]--> Record (jaro_winkler_score)
Record --[same_zip]--> Record
Record --[same_company]--> RecordRecords from multiple sources become nodes. Blocking keys create candidate edges. The GNN predicts which edges are true entity matches.
PyG architecture: SAGEConv + link prediction
import torch
import torch.nn.functional as F
from torch_geometric.nn import SAGEConv, Linear
class EntityResolutionGNN(torch.nn.Module):
def __init__(self, in_dim, hidden_dim=128):
super().__init__()
self.lin = Linear(in_dim, hidden_dim)
self.conv1 = SAGEConv(hidden_dim, hidden_dim)
self.conv2 = SAGEConv(hidden_dim, hidden_dim)
def encode(self, x, edge_index):
x = F.relu(self.lin(x))
x = F.relu(self.conv1(x, edge_index))
x = self.conv2(x, edge_index)
return F.normalize(x, dim=-1)
def predict_link(self, z, edge_label_index):
src, dst = edge_label_index
return (z[src] * z[dst]).sum(dim=-1)
def forward(self, x, edge_index, pos_edges, neg_edges):
z = self.encode(x, edge_index)
pos_score = self.predict_link(z, pos_edges)
neg_score = self.predict_link(z, neg_edges)
pos_loss = -F.logsigmoid(pos_score).mean()
neg_loss = -F.logsigmoid(-neg_score).mean()
return pos_loss + neg_loss
# Inference: score all candidate pairs
model.eval()
z = model.encode(data.x, data.edge_index)
scores = model.predict_link(z, candidate_edges)
matches = candidate_edges[:, scores > threshold]SAGEConv encodes records using neighbor context. Link prediction scores candidate pairs. Transitive signals flow through the graph: shared neighbors increase match probability.
Training considerations
- Blocking: Generate candidate pairs using multiple blocking keys (phone, email, name prefix + zip). Union the blocks to increase recall. The graph is built on these candidates.
- Negative sampling: Sample hard negatives from the same block (records that share a blocking key but are different entities). Easy negatives (random pairs) do not teach the model much.
- Feature encoding: Use pre-trained sentence transformers to encode name and address fields. Hash phone and email for exact-match features. Concatenate all as the node feature vector.
- Connected components: After scoring, extract connected components from the match graph. Each component is a resolved entity. Handle conflicts (A matches B, A matches C, but B and C do not match) with cluster-level optimization.
Expected performance
- String similarity (Jaro-Winkler): ~55 F1
- LightGBM (pairwise features): 62.44 AUROC
- GNN (SAGEConv link prediction): 75.83 AUROC
- KumoRFM (zero-shot): 76.71 AUROC
Or use KumoRFM in one line
PREDICT is_same_entity FOR record_a, record_b
USING records, attributes, interactionsOne PQL query. KumoRFM handles blocking, graph construction, and match prediction automatically.
KumoRFM replaces blocking strategy design, graph construction, model training, and threshold tuning with a single query. It achieves 76.71 AUROC zero-shot, capturing transitive match signals automatically through its pre-trained graph transformer.