What data is needed for lookalike audience modeling?

Kumo connects directly to your existing relational tables: USERS, BEHAVIORS, SEGMENTS, CONVERSIONS, DEMOGRAPHICS. No ETL or feature engineering required. Write a PQL query and get explainable predictions in minutes.

3Ranking · Lookalike Audiences

Lookalike Audience Modeling

“Which users look like our best converters?”

Book a demo and get a free trial of the full platform: research agent, fine-tune capabilities, and forward-deployed engineer support.

By submitting, you accept the Terms and Privacy Policy.

Loved by data scientists, ML engineers & CXOs at

A real-world example

Which users look like our best converters?

Platform-native lookalike tools operate on limited signals and treat each user in isolation. They miss the behavioral graph: which content users consume, which products they browse, and how their engagement patterns cluster. For a DTC brand spending $20M on acquisition, a 25% improvement in lookalike quality means $5M in incremental revenue from the same ad spend.

Quick answer

Graph neural networks build lookalike audiences by learning deep behavioral similarity across browsing patterns, purchase history, content engagement, and demographic signals. Unlike platform-native tools that match on surface-level demographics, GNN-based audiences capture multi-dimensional behavioral patterns, producing 25-40% higher conversion rates from the same ad spend.

Approaches compared

4 ways to solve this problem

1. Platform-native lookalikes (Meta, Google)

Upload a seed list of converters to the ad platform. The platform finds similar users based on its own signals (demographics, interests, in-platform behavior).

Best for

Fast setup, zero engineering. Works well when your seed list is large (10K+) and your product appeals broadly.

Watch out for

You have no visibility into what 'similar' means. The platform optimizes for its own metrics, not yours. Quality degrades sharply as you scale beyond 1-2% lookalike size.

2. Propensity scoring on CRM data

Train a classification model (logistic regression, XGBoost) on your first-party data to score non-converters by conversion likelihood.

Best for

Good when you have rich CRM data and want full control over the model. Interpretable and auditable.

Watch out for

Limited to the features you engineer. Misses behavioral graph signals like 'users who browse similar product sequences' or 'users connected to multiple converters.'

3. Collaborative filtering / embedding similarity

Learn user embeddings from interaction matrices (views, clicks, purchases) and find nearest neighbors to your seed audience in embedding space.

Best for

Captures behavioral co-occurrence patterns. Works well for media and e-commerce with dense interaction data.

Watch out for

Treats each interaction type independently. Cannot combine browsing, purchasing, and demographic signals in a single model without extensive engineering.

4. KumoRFM (relational graph ML)

Connect users, behaviors, segments, conversions, and demographics into a single graph. The GNN learns holistic user embeddings that encode behavioral similarity across all dimensions simultaneously.

Best for

Highest-quality audiences. Captures behavioral graph similarity (browsing sequences, category affinities, social connections to converters) that no single-table model can represent.

Watch out for

Requires first-party behavioral data in normalized tables. Adds most value when you have multiple interaction types (browse, click, purchase, engage) to connect.

Key metric: Graph-learned lookalike audiences convert 25-40% better than platform-native demographic-based lookalikes at equivalent audience sizes.

Why relational data changes the answer

Audience quality is determined by how well you measure similarity between users. Platform-native lookalikes measure similarity on a handful of demographic and interest signals. CRM-based propensity models add purchase history but still treat each user as an independent row. Neither approach captures the behavioral graph: which product categories a user browses in sequence, which content they engage with deeply, and how their behavior patterns cluster with existing converters.

Relational models encode all of this into a single user embedding. Two users who look nothing alike demographically (different age, different city, different income) but who exhibit the same browsing-to-purchase sequence for the same product categories will end up close in the embedding space. On the RelBench benchmark, relational models score 76.71 vs 62.44 for single-table approaches -- a gap that translates directly to audience quality and downstream conversion rates.

Platform-native lookalikes are like a dating app that matches people by height, age, and city. You get surface-level similarity but miss compatibility. Graph-based audience modeling is like a matchmaker who watches how people spend their weekends, what they read, what they cook, and who their friends are. The behavioral fingerprint is what predicts a real match, not the demographic profile.

How KumoRFM solves this

Graph-powered intelligence for advertising

Kumo encodes users, behaviors, segments, conversions, and demographics into a single graph. The GNN learns user embeddings that capture deep behavioral similarity, not just demographic overlap. PQL's RANK TOP operator surfaces the highest-scoring non-converters, giving media buyers a ready-to-activate audience list ranked by predicted conversion probability.

From data to predictions

See the full pipeline in action

Connect your tables, write a PQL query, and get predictions with built-in explainability — all in minutes, not months.

Your data

The relational tables Kumo learns from

USERS

user_id	signup_date	geo	device
U301	2024-06-15	US-West	iOS
U302	2024-09-20	US-East	Android
U303	2025-01-05	EU-West	iOS

BEHAVIORS

event_id	user_id	action	category	timestamp
E601	U301	page_view	Electronics	2025-02-28
E602	U302	add_to_cart	Fashion	2025-03-01
E603	U303	page_view	Electronics	2025-03-01

SEGMENTS

segment_id	user_id	segment_name
SEG01	U301	High-intent
SEG02	U302	Browsers
SEG03	U303	New-visitor

CONVERSIONS

conversion_id	user_id	value	timestamp
CVR201	U301	$320	2025-02-28

DEMOGRAPHICS

user_id	age_range	income_tier	interests
U301	25-34	High	Tech, Fitness
U302	35-44	Medium	Fashion, Travel
U303	25-34	High	Tech, Gaming

Write your PQL query

Describe what to predict in 2–3 lines — Kumo handles the rest

PQL

PREDICT BOOL(CONVERSIONS.conversion_id, 0, 30, days)
FOR EACH USERS.user_id
WHERE COUNT(CONVERSIONS.*, -365, 0, days) = 0
RANK TOP 100000

Prediction output

Every entity gets a score, updated continuously

USER_ID	CONVERSION_PROB	RANK	SEGMENT
U303	0.34	1	New-visitor
U302	0.18	2	Browsers
U508	0.15	3	Re-engaged

Understand why

Every prediction includes feature attributions — no black boxes

User U303 -- New-visitor segment

Predicted: 34% conversion probability (Rank #1)

Top contributing features

Browsing pattern similarity to converters

92% match

33% attribution

Category affinity overlap

Electronics

25% attribution

Device and geo match to seed audience

iOS + US-West

18% attribution

Session depth last 7 days

12 pages

14% attribution

Connected users who converted

3 of 8

10% attribution

Feature attributions are computed automatically for every prediction. No separate tooling required. Learn more about Kumo explainability

PQL Documentation

Learn the Predictive Query Language — SQL-like syntax for defining any prediction task in 2–3 lines.

Read docs

Python SDK

Integrate Kumo predictions into your pipelines. Train, evaluate, and deploy models programmatically.

Read docs

Explainability Docs

Understand feature attributions, model evaluation metrics, and how to build trust with stakeholders.

Read docs

Frequently asked questions

Common questions about lookalike audience modeling

How do you build lookalike audiences without third-party data?

Focus on first-party behavioral data: browsing sequences, content engagement, purchase patterns, and email interactions. Graph models extract maximum signal from this data by connecting multiple interaction types through their natural relationships. The richer your first-party behavioral data, the less you depend on third-party audience signals.

What is the best way to find high-value customers for targeting?

Build behavioral similarity models on your first-party data rather than relying on platform-native lookalikes. Graph neural networks learn which behavioral patterns (browsing sequences, category affinities, engagement depth) predict high-value conversion, producing audiences that convert 25-40% better than demographic-only targeting.

How do lookalike audiences scale without losing quality?

Quality degrades with scale because as you expand the audience, you include less-similar users. Graph-based models degrade more gracefully because they measure similarity on a richer set of behavioral dimensions. Where platform lookalikes lose effectiveness at 2-3% expansion, graph-learned audiences maintain quality up to 5-8% because the similarity signal is stronger.

What data do you need for audience modeling?

A seed list of converters, user behavioral events (page views, clicks, add-to-cart, purchases) with timestamps, and user profiles. For best results, add content metadata, product categories, and segment memberships. More connected tables give the model more dimensions of similarity to learn from.

How do you measure lookalike audience quality?

Track incremental conversion rate and incremental ROAS against a holdout group. The only metric that matters is whether the lookalike audience converts at a meaningfully higher rate than a random sample. Graph-based audiences typically show 25-40% higher conversion rates than platform-native lookalikes of the same size.

Bottom line: A DTC brand spending $20M on acquisition generates $5M in incremental revenue by replacing platform-native lookalikes with Kumo's graph-learned audience models. Behavioral graph similarity outperforms demographic-only targeting by 25-40%.

Related use cases

Explore more ad tech use cases

Use Case #1CTR PredictionLearn more

Use Case #2Conversion AttributionLearn more

Use Case #5Creative PerformanceLearn more

Previous#2 Conversion Attribution

Next#4 Bid Optimization

Topics covered

lookalike audience AIaudience modeling MLsimilar user predictionad targeting modelgraph-based audienceKumoRFM audiencebehavioral targetingconversion lookalike

From a leadership team with proven experience

Vanja Josifovski

CEO and Co-Founder, ex-CTO Airbnb, ex-CTO Pinterest

Jure Leskovec

Co-Founder & Chief Scientist, Stanford Professor

Hema Raghavan

Co-Founder & Head of Engineering, ex-AI Lead, LinkedIn

One Platform. One Model. Infinite Predictions.

KumoRFM

Relational Foundation Model

Turn structured relational data into predictions in seconds. KumoRFM delivers zero-shot predictions that rival months of traditional data science. No training, feature engineering, or infrastructure required. Just connect your data and start predicting.

For critical use cases, fine-tune KumoRFM on your data using the Kumo platform and Research Agent for 30%+ higher accuracy than traditional models.

Book a demo and get a free trial of the full platform: research agent, fine-tune capabilities, and forward-deployed engineer support.

Book a Demo Try Free