Entity resolution is the task of determining whether two records in a database refer to the same real-world entity. Graph-based approaches solve this by representing records as nodes and their shared attributes (phone numbers, addresses, transaction partners) as edges. A GNN then learns that records with overlapping neighborhoods are likely duplicates, even when their text fields differ significantly.
Why text matching is not enough
Traditional entity resolution relies on string similarity between record fields: fuzzy matching on names, exact matching on phone numbers, Jaccard similarity on addresses. This fails when:
- Names are spelled differently: “Muhammad” vs “Mohammed” vs “Mohamed”
- Addresses change: a person moves but keeps the same phone number
- Records span languages: “Deutsche Bank” vs “German Bank”
- Intentional obfuscation: fraudsters use slight name variations across accounts
Graph structure provides context that text matching lacks. Two accounts with 80% of their transaction merchants in common are almost certainly the same person, regardless of name spelling.
Graph construction for entity resolution
The graph connects records to their attributes and to each other through shared attributes:
# Build a graph for entity resolution
# Records are nodes, shared attributes create edges
# Node types
graph.add_nodes('record', features=record_embeddings)
graph.add_nodes('phone', features=None) # attribute nodes
graph.add_nodes('address', features=addr_embeddings)
graph.add_nodes('merchant', features=None)
# Edges: records connect to their attributes
graph.add_edges('record', 'has_phone', 'phone')
graph.add_edges('record', 'has_address', 'address')
graph.add_edges('record', 'transacted_at', 'merchant')
# Two records sharing a phone node are likely the same entity
# The GNN learns this from the structural overlapShared attribute nodes create implicit connections between potentially duplicate records. The GNN learns which shared attributes are most indicative of duplicates.
GNN approach
The GNN-based entity resolution pipeline:
- Encode records: Each record node aggregates information from its attribute neighbors (phone, address, merchant). After 2 layers of message passing, each record's representation encodes its full attribute neighborhood.
- Compute similarity: For each candidate pair, compute the similarity (dot product, cosine, or learned metric) between their GNN-encoded representations.
- Classify: A threshold or learned classifier determines whether the pair is a match (same entity) or not.
Enterprise applications
- Customer deduplication: Merging duplicate CRM records improves marketing efficiency and customer 360 views. A large bank may have 10-15% duplicate customer records.
- Fraud ring detection: Identifying multiple accounts controlled by the same person or syndicate. Fraudsters create slight variations but share devices, IP addresses, and transaction patterns.
- Data integration: Linking records across databases (e.g., merging customer data from two merged banks) where schemas and identifiers differ.
- Know Your Customer (KYC): Matching customer records against sanctions lists and adverse media across languages and name transliterations.
Results
Graph-based entity resolution consistently outperforms text-only approaches:
- On records with high text overlap (>80% Jaccard): both methods achieve ~95% F1, graph adds 1-2%
- On records with medium text overlap (40-80%): text achieves 75% F1, graph achieves 88% F1
- On records with low text overlap (<40%): text achieves 45% F1, graph achieves 72% F1
The value of graph structure is largest when text similarity is ambiguous, which is precisely the hard cases that matter most.