Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Guide7 min read

Entity Resolution: Finding Duplicate Records Using Graph Structure

String matching says 'John Smith' and 'J. Smith' are 60% similar. Graph structure says they share a phone number, an address, and five common merchants. That is 99% probability of being the same person. Graphs make entity resolution dramatically more accurate.

PyTorch Geometric

TL;DR

  • 1Entity resolution determines whether two records refer to the same real-world entity. Graph-based approaches use shared connections (phone, address, transactions) in addition to text similarity.
  • 2GNNs encode each record as a node, with edges to shared attributes and connections. The model learns that two nodes with similar neighborhoods are likely the same entity.
  • 3Graph structure resolves cases that text matching cannot: different spellings of the same name, different addresses for the same person who moved, or records in different languages.
  • 4Applications: customer deduplication (merging CRM records), fraud detection (identifying accounts controlled by the same person), and data integration (linking records across databases).
  • 5Graph-based entity resolution improves F1 by 10-25% over text-matching baselines on enterprise datasets, with the largest gains on records with poor text overlap.

Entity resolution is the task of determining whether two records in a database refer to the same real-world entity. Graph-based approaches solve this by representing records as nodes and their shared attributes (phone numbers, addresses, transaction partners) as edges. A GNN then learns that records with overlapping neighborhoods are likely duplicates, even when their text fields differ significantly.

Why text matching is not enough

Traditional entity resolution relies on string similarity between record fields: fuzzy matching on names, exact matching on phone numbers, Jaccard similarity on addresses. This fails when:

  • Names are spelled differently: “Muhammad” vs “Mohammed” vs “Mohamed”
  • Addresses change: a person moves but keeps the same phone number
  • Records span languages: “Deutsche Bank” vs “German Bank”
  • Intentional obfuscation: fraudsters use slight name variations across accounts

Graph structure provides context that text matching lacks. Two accounts with 80% of their transaction merchants in common are almost certainly the same person, regardless of name spelling.

Graph construction for entity resolution

The graph connects records to their attributes and to each other through shared attributes:

er_graph_construction.py
# Build a graph for entity resolution
# Records are nodes, shared attributes create edges

# Node types
graph.add_nodes('record', features=record_embeddings)
graph.add_nodes('phone', features=None)    # attribute nodes
graph.add_nodes('address', features=addr_embeddings)
graph.add_nodes('merchant', features=None)

# Edges: records connect to their attributes
graph.add_edges('record', 'has_phone', 'phone')
graph.add_edges('record', 'has_address', 'address')
graph.add_edges('record', 'transacted_at', 'merchant')

# Two records sharing a phone node are likely the same entity
# The GNN learns this from the structural overlap

Shared attribute nodes create implicit connections between potentially duplicate records. The GNN learns which shared attributes are most indicative of duplicates.

GNN approach

The GNN-based entity resolution pipeline:

  1. Encode records: Each record node aggregates information from its attribute neighbors (phone, address, merchant). After 2 layers of message passing, each record's representation encodes its full attribute neighborhood.
  2. Compute similarity: For each candidate pair, compute the similarity (dot product, cosine, or learned metric) between their GNN-encoded representations.
  3. Classify: A threshold or learned classifier determines whether the pair is a match (same entity) or not.

Enterprise applications

  • Customer deduplication: Merging duplicate CRM records improves marketing efficiency and customer 360 views. A large bank may have 10-15% duplicate customer records.
  • Fraud ring detection: Identifying multiple accounts controlled by the same person or syndicate. Fraudsters create slight variations but share devices, IP addresses, and transaction patterns.
  • Data integration: Linking records across databases (e.g., merging customer data from two merged banks) where schemas and identifiers differ.
  • Know Your Customer (KYC): Matching customer records against sanctions lists and adverse media across languages and name transliterations.

Results

Graph-based entity resolution consistently outperforms text-only approaches:

  • On records with high text overlap (>80% Jaccard): both methods achieve ~95% F1, graph adds 1-2%
  • On records with medium text overlap (40-80%): text achieves 75% F1, graph achieves 88% F1
  • On records with low text overlap (<40%): text achieves 45% F1, graph achieves 72% F1

The value of graph structure is largest when text similarity is ambiguous, which is precisely the hard cases that matter most.

Frequently asked questions

What is entity resolution?

Entity resolution is the task of determining whether two records in a database refer to the same real-world entity. For example, 'John Smith, 123 Main St' and 'J. Smith, 123 Main Street' may be the same person. Graph-based approaches use shared connections (same phone number, same address, same transactions) to resolve ambiguity that text matching alone cannot.

How do graphs help with entity resolution?

Graphs add structural context beyond string similarity. Two customer records that share a phone number, a mailing address, and three common merchants are very likely the same person, even if their names are spelled differently. GNNs learn to combine textual similarity with structural overlap to make more accurate matching decisions.

What is the difference between entity resolution and link prediction?

Link prediction predicts whether an edge will form between two existing nodes. Entity resolution predicts whether two nodes are actually the same entity and should be merged. Entity resolution changes the graph topology (merging nodes), while link prediction adds edges between distinct entities.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.