What data is needed for ad fraud detection?

Kumo connects directly to your existing relational tables: IMPRESSIONS, DEVICES, IPS, PUBLISHERS, CLICK_PATTERNS. No ETL or feature engineering required. Write a PQL query and get explainable predictions in minutes.

6Binary Classification · Fraud Detection

Ad Fraud Detection

“Is this impression from a bot?”

Book a demo and get a free trial of the full platform: research agent, fine-tune capabilities, and forward-deployed engineer support.

By submitting, you accept the Terms and Privacy Policy.

Loved by data scientists, ML engineers & CXOs at

A real-world example

Is this impression from a bot?

Ad fraud costs the industry $84B annually. Rule-based filters catch known patterns but miss sophisticated bot networks that mimic human behavior. These bots share IP ranges, rotate device fingerprints, and generate realistic click patterns that pass individual-level checks. For an ad network processing $2B in spend, a 10% fraud rate means $200M lost to bots.

Quick answer

Graph neural networks detect ad fraud by identifying coordinated bot networks that appear legitimate in isolation but form conspicuous clusters in the device-IP-publisher graph. While rule-based systems catch known patterns, GNNs detect structural anomalies -- shared IP subnets, correlated click timing, device fingerprint cycling -- reducing fraud losses by 60-80%.

Approaches compared

4 ways to solve this problem

1. Rule-based filters (IP blocklists, velocity checks)

Flag impressions that exceed click velocity thresholds, come from known datacenter IPs, or match known bot signatures. The industry baseline.

Best for

Catching known fraud patterns quickly. Low latency, easy to deploy, and fully transparent.

Watch out for

Sophisticated bots rotate IPs, throttle click rates, and mimic human behavior specifically to evade these rules. Rule-based filters catch 30-40% of fraud at best, and false positive rates climb as you tighten thresholds.

2. Anomaly detection (isolation forests, autoencoders)

Train unsupervised models on normal traffic patterns and flag deviations. Catches unusual behavior without needing labeled fraud examples.

Best for

Detecting new fraud patterns that rules haven't been written for yet. Good complement to rule-based systems.

Watch out for

High false positive rates. Treats each impression independently, so coordinated fraud that mimics normal per-device behavior slips through. Cannot detect the network structure of bot farms.

3. Supervised classification (XGBoost on device features)

Train a classifier on labeled fraud/legitimate data using device, IP, and behavioral features. More accurate than rules for known fraud types.

Best for

Strong when you have high-quality labeled training data and the fraud patterns are stable over time.

Watch out for

Requires expensive manual labeling. Degrades quickly as fraudsters adapt. Treats each device independently, missing the coordinated nature of bot networks.

4. KumoRFM (relational graph ML)

Connect impressions, devices, IPs, publishers, and click patterns into a single graph. The GNN detects bot network structure: shared subnets, correlated timing, fingerprint cycling, and publisher concentration anomalies.

Best for

Detecting sophisticated coordinated fraud. The graph structure reveals bot networks that look legitimate at the individual device level but form obvious clusters when you see the connections.

Watch out for

Requires device-level impression data with IP and publisher connections. Most effective when you have enough traffic volume to see network-level patterns (1M+ daily impressions).

Key metric: SAP SALT benchmark: graph-aware fraud models achieve 91% accuracy vs 75% for feature-engineered and 63% for rule-based approaches.

Why relational data changes the answer

Ad fraud is a network problem. A single bot impression can look perfectly normal: reasonable click timing, a real-looking device fingerprint, a residential IP address. But zoom out and you see 47 devices on the same /24 subnet, all clicking the same 3 ads on the same publisher within the same 10-minute window. That coordination is invisible to any model that evaluates impressions independently. Rule-based systems check each impression against thresholds. Supervised classifiers score each device on its own features. Neither can see the forest for the trees.

Relational models read the impression-device-IP-publisher graph and learn what coordinated fraud looks like structurally. They detect that a cluster of devices sharing an IP range, exhibiting correlated click timing, and concentrating on a single publisher forms a pattern that legitimate traffic never produces. This is why graph-based fraud detection catches 60-80% of sophisticated bot traffic while rule-based systems cap out at 30-40%. The SAP SALT benchmark shows similar relational advantages: 91% accuracy for graph-aware models vs 75% for feature-engineered approaches vs 63% for rule-based systems.

Detecting ad fraud one impression at a time is like trying to spot a pickpocket ring by watching each person individually on security cameras. Each pickpocket looks like a normal shopper. But if you overlay their movements on a single map, you see them working in coordinated patterns -- one distracts, another bumps, a third lifts the wallet. Graph-based fraud detection is that overhead map. It reveals the coordination that individual-level analysis cannot see.

How KumoRFM solves this

Graph-powered intelligence for advertising

Kumo builds a graph connecting impressions, devices, IPs, publishers, and click patterns. Bot networks that appear legitimate in isolation form conspicuous clusters in the graph: shared IP subnets, correlated click timing, device fingerprint cycling, and abnormal publisher concentration. The GNN detects these structural anomalies without hand-crafted rules, adapting as fraud tactics evolve.

From data to predictions

See the full pipeline in action

Connect your tables, write a PQL query, and get predictions with built-in explainability — all in minutes, not months.

Your data

The relational tables Kumo learns from

IMPRESSIONS

impression_id	device_id	ip_address	publisher_id	timestamp
IMP801	DEV001	192.168.1.50	PUB01	2025-03-01 02:14
IMP802	DEV002	192.168.1.51	PUB01	2025-03-01 02:14
IMP803	DEV003	10.0.0.88	PUB02	2025-03-01 09:30

DEVICES

device_id	device_type	os	fingerprint_hash
DEV001	Mobile	Android	FP-AA1
DEV002	Mobile	Android	FP-AA2
DEV003	Desktop	Windows	FP-BB1

IPS

ip_address	asn	geo	datacenter
192.168.1.50	AS12345	US-East	True
192.168.1.51	AS12345	US-East	True
10.0.0.88	AS67890	US-West	False

PUBLISHERS

publisher_id	name	category	fraud_history_rate
PUB01	QuickClicks	News	12.4%
PUB02	TechReview	Technology	0.8%

CLICK_PATTERNS

device_id	clicks_last_hour	avg_time_between_clicks	unique_ads
DEV001	147	0.4s	3
DEV002	132	0.5s	3
DEV003	4	45s	4

Write your PQL query

Describe what to predict in 2–3 lines — Kumo handles the rest

PQL

PREDICT BOOL(IMPRESSIONS.is_fraud, 0, 1, hours)
FOR EACH IMPRESSIONS.impression_id

Prediction output

Every entity gets a score, updated continuously

IMPRESSION_ID	DEVICE_ID	FRAUD_PROB	VERDICT
IMP801	DEV001	0.96	Fraud
IMP802	DEV002	0.94	Fraud
IMP803	DEV003	0.03	Legitimate

Understand why

Every prediction includes feature attributions — no black boxes

Impression IMP801 -- Device DEV001

Predicted: 96% fraud probability

Top contributing features

IP subnet cluster size

47 devices on /24

31% attribution

Click velocity (last hour)

147 clicks

26% attribution

Datacenter IP flag

True

20% attribution

Publisher historical fraud rate

12.4%

14% attribution

Device fingerprint rotation frequency

3 per hour

9% attribution

Feature attributions are computed automatically for every prediction. No separate tooling required. Learn more about Kumo explainability

PQL Documentation

Learn the Predictive Query Language — SQL-like syntax for defining any prediction task in 2–3 lines.

Read docs

Python SDK

Integrate Kumo predictions into your pipelines. Train, evaluate, and deploy models programmatically.

Read docs

Explainability Docs

Understand feature attributions, model evaluation metrics, and how to build trust with stakeholders.

Read docs

Frequently asked questions

Common questions about ad fraud detection

How do you detect sophisticated ad fraud bots?

Graph-based detection is the most effective approach for sophisticated bots. These bots mimic human behavior at the individual level, evading rule-based filters. But they cannot hide their network structure: shared IP subnets, correlated click timing, and device fingerprint rotation patterns form obvious clusters in the device-IP-publisher graph.

What percentage of ad traffic is fraudulent?

Industry estimates range from 5-15% of programmatic ad traffic, with some publishers and channels seeing rates above 20%. For an ad network processing $2B in spend, even a 10% fraud rate means $200M lost to bots. Graph-based detection recovers 60-80% of this by catching coordinated fraud that rule-based systems miss.

How do you reduce false positives in ad fraud detection?

Graph-based models reduce false positives because they require convergence of multiple network-level signals (IP clustering, timing correlation, publisher concentration) rather than triggering on single thresholds. A high click rate from a legitimate power user won't trigger a fraud flag because the network context is completely different from a bot farm.

What data do you need for ad fraud detection?

Impression logs with device IDs, IP addresses, publisher IDs, and timestamps. Click events with timing data. Device fingerprint attributes and IP metadata (ASN, datacenter flag, geolocation). The key is having the relational connections between these entities, not just aggregated features per device.

How fast can graph-based fraud detection adapt to new tactics?

Graph models retrain on new data continuously, detecting novel fraud patterns within days of their emergence. Because the model learns structural anomalies rather than specific rule signatures, new bot tactics that change individual device behavior but maintain coordinated network structure are caught immediately without writing new rules.

Bottom line: An ad network processing $2B in annual spend recovers $120-160M by catching sophisticated bot networks that rule-based systems miss. Kumo's graph reveals coordinated fraud clusters across devices, IPs, and publishers that appear legitimate in isolation.

Related use cases

Explore more ad tech use cases

Use Case #1CTR PredictionLearn more

Use Case #4Bid OptimizationLearn more

Use Case #2Conversion AttributionLearn more

Previous#5 Creative Performance Prediction

Topics covered

ad fraud detection AIbot traffic detectioninvalid traffic MLclick fraud predictionimpression fraud modelKumoRFM fraudprogrammatic fraud detectionad verification AI

From a leadership team with proven experience

Vanja Josifovski

CEO and Co-Founder, ex-CTO Airbnb, ex-CTO Pinterest

Jure Leskovec

Co-Founder & Chief Scientist, Stanford Professor

Hema Raghavan

Co-Founder & Head of Engineering, ex-AI Lead, LinkedIn

One Platform. One Model. Infinite Predictions.

KumoRFM

Relational Foundation Model

Turn structured relational data into predictions in seconds. KumoRFM delivers zero-shot predictions that rival months of traditional data science. No training, feature engineering, or infrastructure required. Just connect your data and start predicting.

For critical use cases, fine-tune KumoRFM on your data using the Kumo platform and Research Agent for 30%+ higher accuracy than traditional models.

Book a demo and get a free trial of the full platform: research agent, fine-tune capabilities, and forward-deployed engineer support.

Book a Demo Try Free