What data is needed for lookalike modeling?

Kumo connects directly to your existing relational tables: PROSPECTS, CUSTOMERS, ORDERS. No ETL or feature engineering required. Write a PQL query and get explainable predictions in minutes.

3Binary Classification · Lookalike

Lookalike Modeling

“Which prospects most closely resemble our highest-value customers?”

Book a demo and get a free trial of the full platform: research agent, fine-tune capabilities, and forward-deployed engineer support.

By submitting, you accept the Terms and Privacy Policy.

Loved by data scientists, ML engineers & CXOs at

A real-world example

Which prospects most closely resemble our highest-value customers?

Traditional lookalike models match prospects to customers based on firmographic overlap — industry, company size, and geography. But the best customers share behavioral and relational patterns that demographics cannot capture: similar product usage trajectories, overlapping vendor ecosystems, and comparable buying cadences. Flat lookalike models miss these signals, diluting outbound targeting and inflating cost-per-acquisition.

Quick answer

Lookalike modeling identifies prospects who resemble your highest-value customers. Traditional approaches match on demographics (industry, size, geography), but graph-based models capture behavioral and relational similarities, like shared vendor ecosystems and comparable buying cadences, that demographics cannot see. Graph lookalikes surface 40% more high-value prospects while reducing cost-per-acquisition by 35%.

Approaches compared

4 ways to solve this problem

1. Firmographic Matching

Match prospects to customers based on industry, company size, geography, and tech stack. Score by the number of matching attributes. The standard approach for most outbound sales teams.

Best for

Teams with a clear, narrow ICP where demographics strongly predict fit. Quick to implement with no ML infrastructure.

Watch out for

Assumes demographics determine buying behavior. A 10,000-person Finance company in North America does not buy the same way as another 10,000-person Finance company if one is a hedge fund and the other is a regional bank. Firmographic matching misses these behavioral differences entirely.

2. Collaborative Filtering on Behavior

Find prospects whose behavioral patterns (website visits, content consumption, event attendance) resemble high-value customers. Standard in marketing automation platforms.

Best for

B2C and high-traffic B2B sites where prospect behavioral data is abundant.

Watch out for

Requires prospects to have extensive first-party behavioral data, which they often do not before becoming a lead. Also treats each prospect independently, missing network effects (shared vendors, partner ecosystems).

3. Embedding-Based Similarity (Word2Vec-style)

Learn vector embeddings for companies from transactional data or CRM records. Find prospects whose embeddings are closest to top customers. Popular in data science teams with embedding expertise.

Best for

Teams with large transactional datasets and the infrastructure to compute and serve embeddings at scale.

Watch out for

Embeddings from single-table data miss relational signals. Two companies may look similar in embedding space based on individual attributes but have very different vendor networks, buying cadences, and decision-making structures.

4. KumoRFM (Graph Neural Networks on Relational Data)

Builds a graph connecting prospects, customers, and orders. GNN embeddings encode not just who each entity is but how they relate to everything else in the graph. Prospects whose relational neighborhoods resemble high-LTV customers surface automatically, even if their firmographics look different.

Best for

B2B companies where buying behavior depends on vendor ecosystems, industry peer networks, and multi-hop relationships that demographics cannot capture.

Watch out for

Requires a customer-prospect graph with meaningful connections (shared vendors, industry relationships, partner networks). If prospects are completely disconnected from your customer base, the graph advantage is smaller.

Key metric: Graph-based lookalike models surface 40% more high-value prospects than demographic matching and reduce cost-per-acquisition by 35%, driven by relational similarity signals invisible to flat models.

Why relational data changes the answer

Prospect P004 (Summit Financial, Enterprise, North America) looks like a perfect firmographic match: same industry and size as your top customers. But firmographic matching would also surface P003 (Cascade Retail, SMB, APAC), which looks nothing like your ICP. The graph tells a different story. P004 shares a vendor network with 2 Platinum-tier customers, and its behavioral similarity score to top customers is 0.88. P003 has no vendor network overlap and a similarity score of 0.12. That is obvious. What is less obvious is P002 (Meridian Health, Healthcare, Enterprise, EMEA). Firmographics say Healthcare in EMEA is outside your ICP. But the graph reveals Meridian shares 3 vendor relationships with your best Healthcare customer (Ironclad Health), and accounts in this vendor network convert at 2.4x the base rate.

These relational neighborhoods are invisible to any model that operates on flat prospect attributes. The GNN computes embeddings that encode the full neighborhood structure: which vendors a prospect uses, which industry peers they are connected to, and how similar those peers' buying patterns are to your customers'. On the SAP SALT benchmark, relational models achieve 91% accuracy vs 75% for single-table approaches. For lookalike modeling specifically, the relational advantage is even larger because the task is inherently about similarity, and similarity in a graph is fundamentally richer than similarity in a flat table.

Firmographic lookalike modeling is like recommending books based on genre alone. A relational model also sees which books share the same readers, which authors cite each other, and which bookstores shelve them together. Two books in different genres (a business biography and a technology thriller) might have 80% reader overlap, but genre matching would never surface the connection. The relational neighborhood is what reveals true similarity.

How KumoRFM solves this

Relational intelligence for smarter acquisition

Kumo builds a graph connecting PROSPECTS, CUSTOMERS, and ORDERS. The GNN learns embeddings that encode not just who each entity is, but how they relate to everything else in the graph. Prospects whose relational neighborhoods resemble high-LTV customers surface automatically — even if their firmographics look nothing alike. The model discovers hidden patterns like 'prospects in the same vendor network as your top 10 accounts' that no rule-based system could find.

From data to predictions

See the full pipeline in action

Connect your tables, write a PQL query, and get predictions with built-in explainability — all in minutes, not months.

Your data

The relational tables Kumo learns from

PROSPECTS

prospect_id	company	industry	size	region
P001	Apex Systems	Technology	Mid-Market	North America
P002	Meridian Health	Healthcare	Enterprise	EMEA
P003	Cascade Retail	Retail	SMB	APAC
P004	Summit Financial	Finance	Enterprise	North America

CUSTOMERS

customer_id	company	industry	ltv_tier
CU01	Atlas Corp	Finance	Platinum
CU02	Pinnacle Tech	Technology	Gold
CU03	Ironclad Health	Healthcare	Platinum

ORDERS

order_id	customer_id	amount	timestamp
O701	CU01	$156,000	2025-09-15
O702	CU01	$89,000	2025-10-20
O703	CU02	$62,000	2025-10-01
O704	CU03	$134,000	2025-11-05

Write your PQL query

Describe what to predict in 2–3 lines — Kumo handles the rest

PQL

PREDICT SUM(ORDERS.AMOUNT, 0, 90, days) > 5000
FOR EACH PROSPECTS.PROSPECT_ID

Prediction output

Every entity gets a score, updated continuously

PROSPECT_ID	TIMESTAMP	TARGET_PRED	True_PROB
P001	2025-11-01	True	0.76
P002	2025-11-01	True	0.84
P003	2025-11-01	False	0.14
P004	2025-11-01	True	0.91

Understand why

Every prediction includes feature attributions — no black boxes

Prospect P004 — Summit Financial

Predicted: True (91% probability)

Top contributing features

Shares vendor network with 2 Platinum-tier customers

2 overlaps

33% attribution

Industry — Finance (matches top LTV segment)

Finance

25% attribution

Enterprise size with similar employee distribution

Enterprise

19% attribution

Region — North America (highest close-rate region)

North America

14% attribution

Behavioral similarity score to Platinum customers

0.88

9% attribution

Feature attributions are computed automatically for every prediction. No separate tooling required. Learn more about Kumo explainability

PQL Documentation

Learn the Predictive Query Language — SQL-like syntax for defining any prediction task in 2–3 lines.

Read docs

Python SDK

Integrate Kumo predictions into your pipelines. Train, evaluate, and deploy models programmatically.

Read docs

Explainability Docs

Understand feature attributions, model evaluation metrics, and how to build trust with stakeholders.

Read docs

Frequently asked questions

Common questions about lookalike modeling

How do graph-based lookalike models differ from Facebook lookalike audiences?

Facebook lookalikes match on demographic and behavioral features within Facebook's data. Graph-based lookalikes use your own first-party relational data: customer-to-prospect connections, vendor networks, industry relationships, and transaction patterns. The graph captures business-specific similarity signals that no third-party platform has access to.

Can lookalike models find prospects in new industries?

Yes. This is where graph models excel. If a Healthcare prospect shares vendor relationships and buying patterns with your best Finance customers, the graph surfaces this cross-industry match. Firmographic models would exclude Healthcare entirely because it is outside your historical ICP. Graph models discover new market segments automatically.

How many seed customers do you need for lookalike modeling?

Graph models require fewer seeds than demographic approaches because they extract more signal per customer. As few as 20-30 high-value customers can produce meaningful lookalike results when the relational graph is rich. The key is the quality of connections (order history, vendor relationships, industry links), not the quantity of seeds.

What is the ROI of graph-based lookalike modeling?

Graph lookalikes surface 40% more high-value prospects than demographic matching, reducing cost-per-acquisition by up to 35%. For a B2B company spending $5M on outbound, this translates to $1.75M in savings or 40% more qualified pipeline from the same budget.

Bottom line: Graph-based lookalike models surface 40% more high-value prospects than demographic matching alone, reducing cost-per-acquisition by up to 35%.

Related use cases

Explore more acquisition use cases

Use Case #1Lead ScoringLearn more

Use Case #2Account ScoringLearn more

Use Case #4Propensity to BuyLearn more

Previous#2 Account Scoring

Next#4 Propensity to Buy

Topics covered

lookalike modeling AIprospect scoringcustomer similarity modellookalike audience predictiongraph-based lookalikerelational deep learningKumoRFMhigh-value customer modelingTAM expansionideal customer profilepredictive prospecting

From a leadership team with proven experience

Vanja Josifovski

CEO and Co-Founder, ex-CTO Airbnb, ex-CTO Pinterest

Jure Leskovec

Co-Founder & Chief Scientist, Stanford Professor

Hema Raghavan

Co-Founder & Head of Engineering, ex-AI Lead, LinkedIn

One Platform. One Model. Infinite Predictions.

KumoRFM

Relational Foundation Model

Turn structured relational data into predictions in seconds. KumoRFM delivers zero-shot predictions that rival months of traditional data science. No training, feature engineering, or infrastructure required. Just connect your data and start predicting.

For critical use cases, fine-tune KumoRFM on your data using the Kumo platform and Research Agent for 30%+ higher accuracy than traditional models.

Book a demo and get a free trial of the full platform: research agent, fine-tune capabilities, and forward-deployed engineer support.

Book a Demo Try Free