What is entity resolution?

Entity resolution is the task of determining whether two records in a database refer to the same real-world entity. For example, 'John Smith, 123 Main St' and 'J. Smith, 123 Main Street' may be the same person. Graph-based approaches use shared connections (same phone number, same address, same transactions) to resolve ambiguity that text matching alone cannot.

How do graphs help with entity resolution?

Graphs add structural context beyond string similarity. Two customer records that share a phone number, a mailing address, and three common merchants are very likely the same person, even if their names are spelled differently. GNNs learn to combine textual similarity with structural overlap to make more accurate matching decisions.

What is the difference between entity resolution and link prediction?

Link prediction predicts whether an edge will form between two existing nodes. Entity resolution predicts whether two nodes are actually the same entity and should be merged. Entity resolution changes the graph topology (merging nodes), while link prediction adds edges between distinct entities.

Entity Resolution with Graphs: Finding Duplicate Records Using Structure | Kumo.ai

Entity resolution is the task of determining whether two records in a database refer to the same real-world entity. Graph-based approaches solve this by representing records as nodes and their shared attributes (phone numbers, addresses, transaction partners) as edges. A GNN then learns that records with overlapping neighborhoods are likely duplicates, even when their text fields differ significantly.

Why text matching is not enough

Traditional entity resolution relies on string similarity between record fields: fuzzy matching on names, exact matching on phone numbers, Jaccard similarity on addresses. This fails when:

Names are spelled differently: “Muhammad” vs “Mohammed” vs “Mohamed”
Addresses change: a person moves but keeps the same phone number
Records span languages: “Deutsche Bank” vs “German Bank”
Intentional obfuscation: fraudsters use slight name variations across accounts

Graph structure provides context that text matching lacks. Two accounts with 80% of their transaction merchants in common are almost certainly the same person, regardless of name spelling.

Graph construction for entity resolution

The graph connects records to their attributes and to each other through shared attributes:

er_graph_construction.py

# Build a graph for entity resolution
# Records are nodes, shared attributes create edges

# Node types
graph.add_nodes('record', features=record_embeddings)
graph.add_nodes('phone', features=None)    # attribute nodes
graph.add_nodes('address', features=addr_embeddings)
graph.add_nodes('merchant', features=None)

# Edges: records connect to their attributes
graph.add_edges('record', 'has_phone', 'phone')
graph.add_edges('record', 'has_address', 'address')
graph.add_edges('record', 'transacted_at', 'merchant')

# Two records sharing a phone node are likely the same entity
# The GNN learns this from the structural overlap

Shared attribute nodes create implicit connections between potentially duplicate records. The GNN learns which shared attributes are most indicative of duplicates.

GNN approach

The GNN-based entity resolution pipeline:

Encode records: Each record node aggregates information from its attribute neighbors (phone, address, merchant). After 2 layers of message passing, each record's representation encodes its full attribute neighborhood.
Compute similarity: For each candidate pair, compute the similarity (dot product, cosine, or learned metric) between their GNN-encoded representations.
Classify: A threshold or learned classifier determines whether the pair is a match (same entity) or not.

Enterprise applications

Customer deduplication: Merging duplicate CRM records improves marketing efficiency and customer 360 views. A large bank may have 10-15% duplicate customer records.
Fraud ring detection: Identifying multiple accounts controlled by the same person or syndicate. Fraudsters create slight variations but share devices, IP addresses, and transaction patterns.
Data integration: Linking records across databases (e.g., merging customer data from two merged banks) where schemas and identifiers differ.
Know Your Customer (KYC): Matching customer records against sanctions lists and adverse media across languages and name transliterations.

Results

Graph-based entity resolution consistently outperforms text-only approaches:

On records with high text overlap (>80% Jaccard): both methods achieve ~95% F1, graph adds 1-2%
On records with medium text overlap (40-80%): text achieves 75% F1, graph achieves 88% F1
On records with low text overlap (<40%): text achieves 45% F1, graph achieves 72% F1

The value of graph structure is largest when text similarity is ambiguous, which is precisely the hard cases that matter most.

Key Takeaways

1Entity resolution determines if two records are the same entity. Graphs add structural context (shared phone, address, merchants) beyond text similarity, resolving ambiguous cases.
2Construct the graph with records as nodes and shared attributes as edges. Two records sharing a phone number, address, and transaction partners are likely the same entity.
3GNNs encode each record by aggregating its attribute neighborhood, then compare representations to classify pairs as matches or non-matches.
4Graph approaches shine on hard cases: 72% F1 vs 45% for text matching when text overlap is below 40%. The largest gains are on records that text matching cannot resolve.
5Enterprise applications: customer deduplication (10-15% duplicate rates at large banks), fraud ring detection, cross-database integration, and KYC/sanctions screening.

Entity Resolution: Finding Duplicate Records Using Graph Structure

Why text matching is not enough

Graph construction for entity resolution

GNN approach

Enterprise applications

Results

Frequently asked questions

What is entity resolution?

How do graphs help with entity resolution?

What is the difference between entity resolution and link prediction?

Related

From the Kumo Learn Hub

Learn more about graph ML