What data is needed for ctr prediction?

Kumo connects directly to your existing relational tables: USERS, ADS, IMPRESSIONS, CLICKS, CAMPAIGNS, PUBLISHERS. No ETL or feature engineering required. Write a PQL query and get explainable predictions in minutes.

1Binary Classification · CTR Prediction

CTR Prediction

“What is the click probability for this ad impression?”

Book a demo and get a free trial of the full platform: research agent, fine-tune capabilities, and forward-deployed engineer support.

By submitting, you accept the Terms and Privacy Policy.

Loved by data scientists, ML engineers & CXOs at

A real-world example

What is the click probability for this ad impression?

Traditional CTR models rely on user-level features and ad metadata, missing the cross-entity signals that actually drive clicks: which publishers attract which user segments, how creative fatigue propagates across campaigns, and which ad-user pairings historically convert. For an ad platform serving 10B impressions per day, a 5% CTR lift translates to $120M in additional annual revenue.

Quick answer

Graph neural networks predict click-through rates by learning cross-entity patterns between users, ads, publishers, and campaigns. Instead of hand-engineering features for months, a GNN reads the relational structure directly and discovers signals like 'users who clicked similar creatives on related publishers,' producing 5-15% CTR lift over flat-table baselines.

Approaches compared

4 ways to solve this problem

1. Logistic regression / feature-engineered models

Build user-level and ad-level features (demographics, ad category, device type), then train a logistic regression or shallow model on click/no-click labels.

Best for

Fast baseline when you have limited data and a small engineering team. Easy to explain and audit.

Watch out for

Requires months of feature engineering. Misses cross-entity interactions like publisher-audience affinity and creative fatigue propagation across campaigns.

2. Deep learning (wide-and-deep, DeepFM)

Combine memorization (wide component) with generalization (deep component) to model feature interactions. Popular at Google, Huawei, and other ad platforms.

Best for

Good at learning feature crosses from sparse categorical data. Strong when you have billions of training examples.

Watch out for

Still operates on a flat feature table. Cannot capture multi-hop relationships like 'this user's segment historically converts on this publisher's inventory for this advertiser category.'

3. Gradient boosted trees (XGBoost, LightGBM)

Train an ensemble of decision trees on tabular features. The go-to baseline for most ad tech prediction tasks.

Best for

Strong out-of-the-box performance on structured data. Handles missing values and mixed feature types well.

Watch out for

Treats each impression independently. Cannot learn from the graph structure connecting users to ads to publishers to campaigns without extensive manual feature joins.

4. KumoRFM (relational graph ML)

Point Kumo at your raw impression, click, campaign, and publisher tables. Write a two-line PQL query. The GNN automatically discovers cross-table temporal patterns.

Best for

Highest accuracy with minimal feature engineering. Captures publisher-user affinity, creative fatigue cycles, and campaign-level budget signals all at once.

Watch out for

Requires relational data in normalized tables. Not the right tool if you only have a single pre-aggregated feature CSV.

Key metric: RelBench benchmark: relational models score 76.71 vs 62.44 for single-table baselines on prediction tasks over multi-table data.

Why relational data changes the answer

CTR prediction is inherently a multi-entity problem. The click decision depends on the user (interests, device, context), the ad (creative type, campaign objective, frequency), and the publisher (content category, audience quality, placement). Flat-table models force you to collapse all of this into a single row of features per impression, destroying the relational structure that actually drives clicks. You end up with columns like 'user_avg_ctr_last_7d' and 'publisher_avg_ctr_electronics' -- static aggregates that miss the dynamic interplay between entities.

Relational models read these tables as a connected graph. They learn that User U003 on mobile has high affinity for Electronics ads specifically when shown on lifestyle publishers with premium inventory, and that this pattern strengthens on weekday mornings. These multi-hop, time-aware signals are exactly what manual feature engineering tries to approximate but rarely captures completely. On benchmarks like RelBench, relational approaches score 76.71 vs 62.44 for single-table baselines -- a gap that translates directly to millions in ad revenue at scale.

Predicting CTR from a flat feature table is like casting a movie by looking at each actor's headshot in isolation. You might pick individually talented people, but you will miss the chemistry between them. The relational graph is the screen test -- it shows you how the user, the ad creative, and the publisher context interact together, and that interaction is what makes the audience click.

How KumoRFM solves this

Graph-powered intelligence for advertising

Kumo builds a heterogeneous graph connecting users, ads, impressions, clicks, campaigns, and publishers. The GNN learns latent patterns like 'users who clicked similar creatives on related publishers' without manual feature engineering. PQL lets you express the prediction in two lines while Kumo automatically discovers the cross-table signals that traditional models require months of feature work to approximate.

From data to predictions

See the full pipeline in action

Connect your tables, write a PQL query, and get predictions with built-in explainability — all in minutes, not months.

Your data

The relational tables Kumo learns from

USERS

user_id	segment	device_type	geo
U001	Tech-savvy	Mobile	US-West
U002	Bargain-hunter	Desktop	US-East
U003	Luxury	Mobile	EU-West

ADS

ad_id	campaign_id	creative_type	category
A100	CMP01	Video	Electronics
A101	CMP02	Banner	Fashion
A102	CMP01	Native	Electronics

IMPRESSIONS

impression_id	user_id	ad_id	publisher_id	timestamp
IMP5001	U001	A100	PUB01	2025-03-01 08:12
IMP5002	U002	A101	PUB02	2025-03-01 09:45
IMP5003	U003	A102	PUB03	2025-03-01 10:30

CLICKS

click_id	impression_id	user_id	timestamp
CLK301	IMP5001	U001	2025-03-01 08:12
CLK302	IMP4990	U002	2025-02-28 14:20

CAMPAIGNS

campaign_id	advertiser	budget	objective
CMP01	TechCorp	$500K	Conversions
CMP02	FashionBrand	$200K	Awareness

PUBLISHERS

publisher_id	name	category	avg_ctr
PUB01	TechNews	Technology	2.1%
PUB02	StyleMag	Fashion	1.8%
PUB03	LuxuryDigest	Lifestyle	3.2%

Write your PQL query

Describe what to predict in 2–3 lines — Kumo handles the rest

PQL

PREDICT BOOL(CLICKS.click_id, 0, 1, hours)
FOR EACH IMPRESSIONS.impression_id

Prediction output

Every entity gets a score, updated continuously

IMPRESSION_ID	USER_ID	AD_ID	CLICK_PROB
IMP5001	U001	A100	0.087
IMP5002	U002	A101	0.023
IMP5003	U003	A102	0.142

Understand why

Every prediction includes feature attributions — no black boxes

Impression IMP5003 -- User U003 x Ad A102

Predicted: 14.2% click probability

Top contributing features

User affinity for Electronics category

High

31% attribution

Publisher LuxuryDigest avg CTR

3.2%

24% attribution

Native creative on mobile

True

19% attribution

User clicked similar ads (last 7d)

4 clicks

15% attribution

Campaign frequency cap remaining

8 of 10

11% attribution

Feature attributions are computed automatically for every prediction. No separate tooling required. Learn more about Kumo explainability

PQL Documentation

Learn the Predictive Query Language — SQL-like syntax for defining any prediction task in 2–3 lines.

Read docs

Python SDK

Integrate Kumo predictions into your pipelines. Train, evaluate, and deploy models programmatically.

Read docs

Explainability Docs

Understand feature attributions, model evaluation metrics, and how to build trust with stakeholders.

Read docs

Frequently asked questions

Common questions about ctr prediction

What is the best machine learning model for CTR prediction?

Graph neural networks that operate on relational data (users, ads, publishers, campaigns) outperform flat-table models by capturing cross-entity interactions. Traditional approaches like DeepFM and XGBoost require months of feature engineering to approximate what a GNN learns automatically from the table structure. The performance gap widens as your data becomes more relational.

How do you improve CTR prediction beyond feature engineering?

Stop collapsing your relational data into flat feature rows. Instead, let a graph model read your normalized tables directly. The biggest accuracy gains come from cross-entity signals -- publisher-audience affinity, creative fatigue propagation, campaign budget pacing -- that are nearly impossible to capture as hand-crafted features.

What data do you need for a CTR prediction model?

At minimum: impression logs, click events, and user profiles. For best results, add campaign metadata, publisher attributes, creative features, and device context. The power comes from joining these tables through their natural foreign keys, not from any single data source.

How does creative fatigue affect CTR prediction?

Creative fatigue is a cross-entity, temporal signal: the same ad shown to the same user segment on the same publisher decays in click rate over time. Flat models capture this crudely as 'days since creative launch.' Relational models track fatigue per user-creative-publisher combination, detecting when a creative still works for new audiences even as it fatigues returning ones.

What CTR lift can you expect from better prediction models?

Teams moving from hand-engineered feature tables to relational graph models typically see 5-15% CTR lift. For an ad platform serving billions of daily impressions, even a 5% improvement translates to $100M+ in annual revenue. The lift comes from capturing signals that flat models structurally cannot represent.

Bottom line: An ad platform serving 10B daily impressions that improves CTR prediction by 5% unlocks $120M in annual revenue. Kumo captures cross-entity signals between users, creatives, and publishers that flat feature tables miss entirely.

Related use cases

Explore more ad tech use cases

Use Case #2Conversion AttributionLearn more

Use Case #4Bid OptimizationLearn more

Use Case #5Creative PerformanceLearn more

Next#2 Conversion Attribution

Topics covered

CTR predictionclick-through rate AIad impression predictionprogrammatic advertising MLgraph neural network adsKumoRFM ad techreal-time bidding predictionad click prediction modelrelational deep learning advertising

From a leadership team with proven experience

Vanja Josifovski

CEO and Co-Founder, ex-CTO Airbnb, ex-CTO Pinterest

Jure Leskovec

Co-Founder & Chief Scientist, Stanford Professor

Hema Raghavan

Co-Founder & Head of Engineering, ex-AI Lead, LinkedIn

One Platform. One Model. Infinite Predictions.

KumoRFM

Relational Foundation Model

Turn structured relational data into predictions in seconds. KumoRFM delivers zero-shot predictions that rival months of traditional data science. No training, feature engineering, or infrastructure required. Just connect your data and start predicting.

For critical use cases, fine-tune KumoRFM on your data using the Kumo platform and Research Agent for 30%+ higher accuracy than traditional models.

Book a demo and get a free trial of the full platform: research agent, fine-tune capabilities, and forward-deployed engineer support.

Book a Demo Try Free