Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
Learn12 min read

Entity Resolution: The Data Quality Problem Nobody Talks About

Your database says you have 10 million customers. You actually have 7 million. The other 3 million are duplicates, and they are quietly corrupting every model, every campaign, and every report in your organization.

TL;DR

  • 1The average enterprise database has a 15-25% duplicate rate. For 10 million customer records, that is 1.5-2.5 million duplicates corrupting every downstream model, campaign, and compliance report.
  • 2Deterministic rules (exact email match, name + zip code) handle the easy 80% of duplicates. The remaining 20% causes outsized damage: fragmented CLV, redundant marketing, and incomplete KYC profiles.
  • 3Behavioral fingerprinting resolves records that share no attributes. Two records with different names but identical merchant preferences, transaction timing, and payment methods are almost certainly the same person.
  • 4Graph ML achieves 94.9% F1 on entity resolution by combining attribute similarity with behavioral fingerprints, shared network analysis, temporal continuity, and transitive match propagation through clusters.
  • 5Resolving duplicates has cascading benefits: 22% marketing cost reduction, 35% more accurate CLV models, and reduced compliance risk where a single BSA/AML enforcement action averages $18M in fines.

Gartner estimates that poor data quality costs the average organization $12.9 million per year. The largest single contributor to that cost is duplicate records: the same customer appearing as two, three, or ten different entries across your systems.

This is not a minor data hygiene issue. Duplicate records mean your churn model counts one churning customer as two retained ones. Your CLV model underestimates high-value customers whose spend is fragmented across records. Your marketing team sends three copies of the same campaign to the same person. Your compliance team fails KYC reviews because the full relationship picture is spread across unlinked records.

The reason this problem persists is not that nobody has tried to solve it. It is that the standard approaches hit a ceiling, and the remaining 15-20% of unresolved duplicates cause outsized damage.

customer_records (suspected duplicates)

record_idnameemailaddressphone
R-001Robert J. Smithrsmith@gmail.com123 Main St, Apt 4B555-0142
R-002Bob Smithbob.smith@work.com123 Main Street #4B555-0142
R-003R. SmithP.O. Box 881, 10001
R-004Robert Smith Jr.rjsmith@outlook.com456 Oak Ave555-0199
R-005Roberto Smithrsmith@gmail.com123 Main St, 4B

R-001, R-002, R-003, and R-005 are likely the same person. R-004 is a different person (son). Deterministic rules struggle here: email matches R-001/R-005 but not R-002/R-003. Address matches partially. Name varies wildly.

transaction_patterns (last 6 months)

record_idtop_merchantsavg_txnpreferred_timepayment_method
R-001Whole Foods, Shell, Amazon$67.40EveningsVisa ending 4821
R-002Whole Foods, Shell, Amazon$72.10EveningsDebit ending 9033
R-003Whole Foods, Shell Gas, AMZN$68.90EveningsVisa ending 4821
R-004GameStop, Chipotle, Spotify$23.50AfternoonsApple Pay
R-005Whole Foods, Shell, Amazon$65.20EveningsVisa ending 4821

Highlighted: R-001, R-002, R-003, and R-005 share near-identical behavioral fingerprints. R-004 has a completely different pattern. The graph confirms what attributes alone cannot.

How entities become duplicated

Duplicates enter databases through six primary channels, and understanding these channels explains why simple matching rules fail.

Data entry variation

"Robert Smith" and "Bob Smith" are the same person. "123 Main Street, Apt 4B" and "123 Main St #4B" are the same address. "ABC Corporation" and "ABC Corp." and "A.B.C. Corporation" are the same company. Every field that accepts free text generates variation. At enterprise scale, with millions of records entered by thousands of people over years, the variation is enormous.

System mergers and acquisitions

When companies merge, their databases merge. Customer A in System 1 may be the same as Customer B in System 2, but the identifiers are different, the schemas are different, and the field formats are different. A large bank that has completed three acquisitions in the last decade may have the same customer in four separate systems with four different customer IDs.

Channel proliferation

A customer who shops in-store, online, and through a mobile app may be three separate records: one with a loyalty card number, one with an email address, and one with a phone number. None of the identifiers overlap. The only way to connect them is through behavioral and relational signals.

Life events

Name changes (marriage, legal changes), address changes (moves), phone number changes, and email changes all create potential duplicates. A customer who moved and changed their name will not match their previous record on any deterministic attribute.

Why deterministic matching hits a ceiling

The standard approach to entity resolution is deterministic rule matching: if email matches exactly, merge. If first name, last name, and zip code match, merge. If phone number matches, merge. These rules are fast, interpretable, and they handle the easy 80%.

The problem is the other 20%.

A rule that merges on exact email match misses customers who use different email addresses for different channels (work email for B2B, personal email for consumer). A rule that merges on name plus zip code generates false positives for common names in dense zip codes (there are 347 "John Smith" records in the 10001 zip code). Tightening the rules reduces false positives but increases false negatives. Loosening the rules does the opposite.

This is the fundamental limitation: deterministic rules optimize along a single dimension (attribute similarity) and cannot break through the precision-recall tradeoff without additional signal.

Probabilistic matching is better, but still limited

Probabilistic matching (Fellegi-Sunter models) assigns weights to each matching field based on how discriminating it is. An exact match on Social Security number is weighted higher than an exact match on first name. The total weight determines the match probability.

This is a meaningful improvement over deterministic rules because it handles partial matches and field-level confidence. But it still only looks at attributes on the record itself. Two records with completely different attributes but identical behavioral patterns are invisible to probabilistic matching.

Attribute-only matching

  • Compares name, address, email, phone fields
  • Rules handle the easy 80% of duplicates
  • False positives on common names in dense areas
  • Misses customers with different identifiers per channel
  • Cannot resolve records after name/address changes

Relational + behavioral matching

  • Compares transaction patterns, product preferences, timing
  • Identifies shared connections and network overlap
  • Resolves cross-channel identities through behavioral similarity
  • Handles name/address changes by matching on behavior continuity
  • Catches the remaining 15-20% that rules miss

How graph ML transforms entity resolution

The breakthrough in entity resolution comes from looking beyond the record to the relational context around it. Two records that share no attributes may still be clearly the same entity when you look at what they are connected to.

Behavioral fingerprinting

Every customer has a behavioral fingerprint: the products they buy, the times they shop, the payment methods they use, the stores they visit, the categories they browse. Two records with different names and email addresses but near-identical behavioral fingerprints are almost certainly the same person. Graph ML learns these fingerprints from the transaction-product-store graph, not from manually engineered features.

Shared network analysis

If Record A and Record B both transact with the same set of merchants, live in the same building (different apartment numbers), and share a phone number with the same emergency contact, the probability that they are the same person is high. These shared connections are edges in the graph. The model measures overlap in the local neighborhood of each record to assess match probability.

Temporal pattern matching

When Record A stops transacting and Record B starts transacting in the same geographic area with similar patterns, this temporal handoff suggests a single customer who changed identifiers. The model learns these temporal discontinuities as evidence of entity continuity.

temporal_handoff (suspected identity change)

dateR-001 transactionsR-003 transactionslocation
Jan 2025Whole Foods $72, Shell $45Manhattan
Feb 2025Amazon $134, Shell $51Manhattan
Mar 2025Whole Foods $68Manhattan
Apr 2025Whole Foods $71, Shell $48Manhattan
May 2025Amazon $142, Shell $52Manhattan

Highlighted: R-001 stops transacting in March. R-003 starts in April with near-identical patterns at the same merchants, in the same area. This temporal handoff strongly suggests the same person with a new record (perhaps a replaced card or new account).

Transitivity and graph propagation

If Record A matches Record B with 95% confidence, and Record B matches Record C with 90% confidence, what is the probability that A and C are the same entity? Deterministic systems cannot propagate matches through chains. Graph models handle transitivity natively because the information flows along edges. This resolves clusters of duplicates that pairwise matching misses.

transitive_match_chain

pairdirect_attribute_simdirect_match_probtransitive_evidence
R-001 / R-0020.610.72Shared phone: 555-0142
R-002 / R-0050.380.68Same work email domain + Visa 4821
R-001 / R-0050.720.81Shared email: rsmith@gmail.com
R-001 / R-0030.240.41No direct link strong enough

Pairwise rules set at 0.70 threshold would match R-001/R-002 and R-001/R-005, but miss R-001/R-003 (score 0.41). Graph propagation chains: R-001 matches R-002 (phone), R-002 matches R-005 (email domain), R-005 matches R-003 (Visa ending 4821 + merchant overlap). The full cluster resolves through transitivity.

Real-world impact of better entity resolution

When a large retailer resolved its duplicate customer records using relational matching, three things happened.

Marketing efficiency improved by 22%. The marketing team had been sending multiple copies of campaigns to the same customers through different records. Resolving duplicates reduced campaign volume while maintaining reach, cutting direct marketing costs by 22% with no drop in response rates.

CLV models became 35% more accurate. High-value customers whose spend was fragmented across multiple records had been undervalued. After resolution, the CLV model correctly identified the top decile, improving retention program targeting and increasing retained revenue by $18 million annually.

Compliance risk dropped. Financial institutions are required to maintain a single customer view for KYC and AML compliance. Unresolved duplicates create fragmented profiles that regulators flag as deficiencies. Better entity resolution reduces the risk of regulatory fines that can reach hundreds of millions of dollars.

resolution_method_comparison

methodprecisionrecallF1 scorefalse_merge_rate
Exact email match99.1%42.3%59.3%0.9%
Name + zip code84.7%68.9%75.9%15.3%
Probabilistic (Fellegi-Sunter)91.2%78.4%84.3%8.8%
Behavioral + relational ML96.8%93.1%94.9%3.2%

Highlighted: behavioral and relational matching achieves 94.9% F1 by combining attribute signals with transaction patterns, shared networks, and temporal continuity.

PQL Query

PREDICT match_probability
FOR EACH customer_pairs.pair_id
WHERE customer_pairs.attribute_similarity > 0.3

Score all candidate record pairs for match probability. The model considers not just name/address similarity but behavioral fingerprints, shared merchant networks, payment method overlap, and temporal transaction patterns.

Output

pairattribute_simbehavioral_simmatch_probverdict
R-001 / R-0020.610.940.97Match
R-001 / R-0030.380.910.93Match
R-001 / R-0050.720.960.98Match
R-001 / R-0040.540.120.08No match

The foundation model approach

Traditional ML entity resolution requires training a custom model: generating labeled pairs (match/non-match), engineering features from both records and their relational context, training a classifier, and calibrating thresholds. This works well but requires 3-6 months of data science effort and ongoing maintenance.

KumoRFM applies a foundation model approach to entity resolution. The model has been pre-trained on relational patterns across thousands of databases. It already understands the universal patterns that indicate entity matches: behavioral similarity, network overlap, temporal continuity, and attribute fuzzy matching.

You connect your database, specify which records to resolve, and the model produces match probabilities based on the full relational context. No feature engineering, no labeled training data, no custom model. The pre-trained understanding of relational patterns transfers to your specific data.

The implication is significant. Entity resolution has always been treated as a data engineering project: months of work to build matching rules, ongoing maintenance as data changes, and a ceiling on accuracy that attribute-only matching cannot break through. Graph ML on relational data raises that ceiling. Foundation models eliminate the engineering cost to reach it.

If your database has more than a million customer records and you have not performed entity resolution in the last 12 months, you are almost certainly making decisions based on fragmented data. The question is not whether you have duplicates. It is how much they are costing you.

Frequently asked questions

What is entity resolution?

Entity resolution is the process of determining whether two or more records in a database refer to the same real-world entity. For example, 'John Smith at 123 Main St' and 'J. Smith at 123 Main Street' might be the same person, or they might not. Entity resolution goes beyond simple string matching to consider behavioral patterns, relational context, and transactional history to make accurate match decisions at scale.

Why is entity resolution so hard?

Three reasons: (1) Name and address variations are nearly infinite (abbreviations, misspellings, maiden names, P.O. boxes vs. street addresses); (2) Deterministic rules that work for 80% of cases produce false positives or false negatives for the remaining 20%, which at enterprise scale means millions of errors; (3) The same person can appear across multiple systems with completely different identifiers, requiring cross-system matching that simple rules cannot handle.

How much do duplicate records cost enterprises?

Gartner estimates that poor data quality costs organizations an average of $12.9 million per year. For customer data specifically, duplicates lead to fragmented customer views that cause redundant marketing spend (mailing the same person twice), inaccurate analytics (counting one customer as two), compliance failures (incomplete KYC profiles), and missed cross-sell opportunities. IBM estimates that bad data costs the US economy $3.1 trillion annually.

How does graph ML improve entity resolution?

Graph ML looks beyond record attributes (name, address, email) to analyze the relational context: do these two records share transaction patterns, interact with the same entities, have similar behavioral sequences, or connect to the same network of contacts? Two records with different names but identical purchase histories at the same stores, similar transaction timing, and shared loyalty accounts are almost certainly the same person. These relational signals are invisible to attribute-only matching.

Can KumoRFM perform entity resolution?

KumoRFM can predict match probability between record pairs by learning from the full relational context of your database. Rather than engineering matching rules or training a custom model, you connect your data and write a predictive query. The foundation model leverages pre-trained relational patterns to identify matches based on behavioral similarity, shared connections, and transactional overlap, not just string similarity.

See it in action

KumoRFM delivers predictions on relational data in seconds. No feature engineering, no ML pipelines. Try it free.