2Binary Classification · Deduplication

Duplicate Detection

“For each record in the CRM, is there a duplicate entry in the system?”

Book a demo and get a free trial of the full platform: research agent, fine-tune capabilities, and forward-deployed engineer support.

By submitting, you accept the Terms and Privacy Policy.

Loved by data scientists, ML engineers & CXOs at

A real-world example

For each record in the CRM, is there a duplicate entry in the system?

Duplicate records inflate customer counts by 15-25%, skew analytics, and cause embarrassing double-outreach. Traditional dedup rules (exact email match) miss variations like "john@acme.com" vs "j.smith@acme-corp.com". Kumo detects duplicates through behavioral overlap — same purchasing patterns, shared addresses, overlapping device fingerprints. Each unmerged duplicate costs $10-30 annually in wasted marketing, and enterprises with millions of records face seven-figure losses.

How KumoRFM solves this

Relational intelligence for identity resolution

Kumo connects CRM records to their transactions, support interactions, and behavioral signals in a unified relational graph. Instead of comparing email strings, Kumo learns that Record R-101 and Record R-204 share the same purchasing cadence, contact the same support agents, and transact with the same merchants. The binary classifier predicts whether each record has a duplicate anywhere in the system — flagging matches that deterministic rules would never catch.

From data to predictions

See the full pipeline in action

Connect your tables, write a PQL query, and get predictions with built-in explainability — all in minutes, not months.

Your data

The relational tables Kumo learns from

RECORDS

record_id	name	email	company	source
R-101	John Smith	john@acme.com	Acme Corp	website
R-204	J. Smith	j.smith@acme-corp.com	ACME	trade show
R-350	Maria Lopez	mlopez@bigco.io	BigCo Inc	referral

MATCH_CANDIDATES

match_id	record_id	candidate_id	similarity_score	timestamp
MC-001	R-101	R-204	0.82	2025-09-14
MC-002	R-350	R-612	0.74	2025-09-14
MC-003	R-101	R-550	0.45	2025-09-15

TRANSACTIONS

txn_id	record_id	amount	timestamp
TXN-8001	R-101	$1,249.00	2025-09-10
TXN-8002	R-204	$1,249.00	2025-09-10
TXN-8003	R-350	$487.50	2025-09-12

Write your PQL query

Describe what to predict in 2–3 lines — Kumo handles the rest

PQL

PREDICT COUNT(MATCH_CANDIDATES.*
    WHERE MATCH_CANDIDATES.SIMILARITY_SCORE > 0.8,
    0, 30, days) > 0
FOR EACH RECORDS.RECORD_ID

Prediction output

Every entity gets a score, updated continuously

RECORD_ID	TIMESTAMP	TARGET_PRED	True_PROB
R-101	2025-10-01	True	0.96
R-204	2025-10-01	True	0.96
R-350	2025-10-01	False	0.18

Understand why

Every prediction includes feature attributions — no black boxes

Record R-101 (John Smith, Acme Corp)

Predicted: 96% probability of having a duplicate

Top contributing features

Transaction amount overlap with R-204

Exact match

32% attribution

Company name similarity

0.88

24% attribution

Phone number overlap

Same

20% attribution

Behavioral cadence similarity

0.91

14% attribution

Source channel difference

Different

10% attribution

Feature attributions are computed automatically for every prediction. No separate tooling required. Learn more about Kumo explainability

PQL Documentation

Learn the Predictive Query Language — SQL-like syntax for defining any prediction task in 2–3 lines.

Read docs

Python SDK

Integrate Kumo predictions into your pipelines. Train, evaluate, and deploy models programmatically.

Read docs

Explainability Docs

Understand feature attributions, model evaluation metrics, and how to build trust with stakeholders.

Read docs

Bottom line: Eliminate 15-25% duplicate records from your CRM — correcting inflated customer counts, fixing attribution, and saving $1-5M annually in wasted outreach.

Related use cases

Explore more entity resolution use cases

Use Case #1Identity MatchingLearn more

Use Case #5Account DeduplicationLearn more

Use Case #3Record LinkingLearn more

Previous#1 Identity Matching

Next#3 Record Linking

Topics covered

duplicate detection AICRM deduplication machine learningrecord deduplicationdata quality AIgraph-based deduplicationKumoRFMrelational deep learningpredictive query languagemaster data managementduplicate record detectionCRM data cleaningautomated deduplication

From a leadership team with proven experience

Vanja Josifovski

CEO and Co-Founder, ex-CTO Airbnb, ex-CTO Pinterest

Jure Leskovec

Co-Founder & Chief Scientist, Stanford Professor

Hema Raghavan

Co-Founder & Head of Engineering, ex-AI Lead, LinkedIn