What data is needed for lead scoring?

Kumo connects directly to your existing relational tables: LEADS, ACTIVITIES, ORDERS. No ETL or feature engineering required.

1Binary Classification · Lead Scoring

Lead Scoring

“Which leads will convert to a paying customer in the next 30 days?”

Book a demo and get a free trial of the full platform: research agent, fine-tune capabilities, and forward-deployed engineer support.

By submitting, you accept the Terms and Privacy Policy.

Loved by data scientists, ML engineers & CXOs at

A real-world example

Which leads will convert to a paying customer in the next 30 days?

Sales teams waste 60% of their time on leads that never convert. Current lead scoring uses demographic rules — company size plus job title — missing behavioral and relational signals entirely. The result is bloated pipelines, burned-out SDRs, and missed quota. If you could score leads by actual conversion probability, reps focus on the 20% of leads that drive 80% of pipeline, shortening sales cycles and dramatically improving win rates.

Quick answer

Lead scoring predicts which leads will convert to paying customers within a defined time window. The best models go beyond demographic rules (company size + job title) by learning from behavioral signals, product usage patterns, and relational connections like 'leads whose colleagues at the same company already purchased.' Graph-based lead scoring delivers 3x better accuracy than rule-based approaches.

Approaches compared

4 ways to solve this problem

1. Rule-Based Scoring (BANT / Demographic)

Assign points based on demographic fit: company size, industry, job title, budget, authority, need, timing. The default in most CRMs (Salesforce, HubSpot).

Best for

Teams with no ML capability that need a quick scoring system. Good for initial lead qualification when the ICP is well-defined.

Watch out for

Rules miss behavioral signals entirely. A VP at a Fortune 500 who never opens emails scores higher than a director at a mid-market company who requested a demo, viewed pricing, and invited colleagues to a webinar. Expect 30-40% accuracy.

2. Logistic Regression on CRM Data

Train a logistic regression model on CRM fields: source, industry, engagement score, days in pipeline. Interpretable coefficients help sales understand the scoring logic.

Best for

Teams that want data-driven scoring with interpretable results. A solid step up from rules when the feature set is limited to CRM fields.

Watch out for

Limited to features explicitly stored in the CRM. Misses cross-lead patterns (colleagues at the same company), product usage signals, and temporal sequences (pricing page viewed after webinar attendance).

3. Gradient Boosted Trees (XGBoost) on Enriched Features

Train XGBoost on hand-crafted features from CRM, marketing automation, and product analytics. Captures non-linear relationships and feature interactions.

Best for

Teams with ML engineers and data from multiple systems. Good accuracy when features are well-engineered and regularly refreshed.

Watch out for

Feature engineering is the bottleneck. Each new signal (product usage, support interactions, webinar attendance) requires a new feature pipeline. The model treats each lead independently, missing network effects like 'other leads at the same company are also engaging.'

4. KumoRFM (Graph Neural Networks on Relational Data)

Connects leads, activities, orders, and company relationships into a heterogeneous graph. Automatically discovers signals like 'leads who viewed pricing after a webinar' and 'leads whose colleagues at the same company already purchased.' Zero feature engineering required.

Best for

B2B sales teams with CRM, product usage, and marketing data who want maximum accuracy without building feature pipelines.

Watch out for

Requires activity-level data with timestamps (page views, email interactions, demo requests). If your CRM only has static lead attributes without behavioral data, enrich it first.

Key metric: SAP SALT benchmark: 91% accuracy for multi-table relational models vs 75% for single-table ML and 63% for rule-based scoring. Graph-based leads convert 3.2x more often.

Why relational data changes the answer

Lead L001 (Acme Corp, Finance, webinar source) viewed the pricing page 1 day after signup and requested a demo 2 days later. A flat model sees these as two features: 'viewed pricing = true' and 'requested demo = true.' But the sequence matters: pricing-then-demo is a much stronger signal than demo-then-pricing (which often indicates comparison shopping). The relational graph captures this temporal ordering automatically.

More importantly, L001 is connected to 2 existing customers in the Finance industry. These connections live in the company-to-company relationship graph, not in any single lead's attributes. The GNN propagates signals from converted leads to unconverted ones at the same company, in the same industry, or with similar engagement patterns. On the SAP SALT benchmark, models with access to multi-table relational signals achieve 91% accuracy vs 75% for single-table models. For lead scoring specifically, the relational signals (colleague conversions, industry peer behavior, engagement sequence patterns) are often more predictive than the lead's own demographic attributes. This is why graph-based lead scoring delivers 3.2x higher conversion rates than rule-based scoring.

Scoring leads with demographic rules is like hiring employees based solely on their resume. A relational model also checks their references (connected customers), reviews their work samples (product usage behavior), and sees that their former colleagues who joined your company all became top performers. The resume gets you to the interview; the relational context tells you who to hire.

How KumoRFM solves this

Relational intelligence for smarter acquisition

Kumo builds a heterogeneous graph across your CRM, product usage, support interactions, and marketing touchpoints. Instead of hand-crafted rules, the graph neural network automatically discovers signals like 'leads whose colleagues at the same company already purchased' or 'leads who viewed pricing pages after a webinar.' The model learns from every relationship in your data — not just flat lead attributes — delivering conversion probabilities that are 3x more accurate than rule-based scoring, with zero feature engineering.

From data to predictions

See the full pipeline in action

Connect your tables, write a PQL query, and get predictions with built-in explainability — all in minutes, not months.

Your data

The relational tables Kumo learns from

LEADS

lead_id	company	industry	source	signup_date
L001	Acme Corp	Finance	webinar	2025-11-01
L002	Beta Ltd	Retail	organic	2025-11-03
L003	Gamma Inc	Healthcare	paid_search	2025-11-05
L004	Delta Co	Finance	referral	2025-11-07

ACTIVITIES

activity_id	lead_id	activity_type	page	timestamp
A101	L001	page_view	/pricing	2025-11-02
A102	L001	demo_request	/demo	2025-11-04
A103	L002	page_view	/blog	2025-11-04
A104	L003	page_view	/pricing	2025-11-06
A105	L004	email_click	/case-study	2025-11-08

ORDERS

order_id	lead_id	amount	timestamp
O501	L001	$24,000	2025-11-15
O502	L004	$18,500	2025-11-20

Write your PQL query

Describe what to predict in 2–3 lines — Kumo handles the rest

PQL

PREDICT COUNT(ORDERS.*, 0, 30, days) > 0
FOR EACH LEADS.LEAD_ID

Prediction output

Every entity gets a score, updated continuously

LEAD_ID	TIMESTAMP	TARGET_PRED	True_PROB
L001	2025-11-01	True	0.89
L002	2025-11-03	False	0.12
L003	2025-11-05	True	0.74
L004	2025-11-07	True	0.81

Understand why

Every prediction includes feature attributions — no black boxes

Lead L001 — Acme Corp

Predicted: True (89% probability)

Top contributing features

Viewed pricing page within 3 days of signup

True

34% attribution

Requested demo after webinar attendance

True

27% attribution

Company industry — Finance

Finance

18% attribution

Lead source — webinar

webinar

13% attribution

Connected to 2 existing customers in same industry

2 connections

8% attribution

Feature attributions are computed automatically for every prediction. No separate tooling required. Learn more about Kumo explainability

PQL Documentation

Learn the Predictive Query Language — SQL-like syntax for defining any prediction task in 2–3 lines.

Read docs

Python SDK

Integrate Kumo predictions into your pipelines. Train, evaluate, and deploy models programmatically.

Read docs

Explainability Docs

Understand feature attributions, model evaluation metrics, and how to build trust with stakeholders.

Read docs

Frequently asked questions

Common questions about lead scoring

How accurate is AI lead scoring compared to rule-based scoring?

Graph-based lead scoring delivers 3.2x higher conversion rates than rule-based scoring. On the SAP SALT benchmark, multi-table relational models achieve 91% accuracy vs 75% for single-table ML and 63% for rules. The improvement comes from behavioral sequences, cross-lead relationships, and product usage signals that rules cannot capture.

What data do I need for predictive lead scoring?

At minimum: a leads table with attributes and an activities table with timestamped interactions (page views, email clicks, demo requests). High-value additions include product usage data (for PLG companies), existing customer data (to learn from successful conversions), and company/industry relationship data. The more relational tables you connect, the more cross-lead signals the graph discovers.

How often should lead scores be updated?

Real-time or at least daily. A lead who requested a demo 10 minutes ago should score higher than one who did so last week. Batch scoring on weekly cycles misses the urgency signals that drive conversion. Kumo updates scores as new activity data flows in.

Can lead scoring work for product-led growth (PLG) companies?

Absolutely. PLG companies have the richest lead scoring signals because free-tier users generate product usage data before any sales interaction. Features like 'invited a teammate,' 'used the API,' or 'exported data' are strong conversion predictors. Graph models connect usage data to lead data automatically, scoring leads by actual product engagement rather than marketing activity alone.

Bottom line: Kumo-scored leads convert 3.2x more often than rule-based scoring. Sales reps reclaim 60% of prospecting time by focusing on leads the model identifies as high-probability converters.

Related use cases

Explore more acquisition use cases

Use Case #2Account ScoringLearn more

Use Case #4Propensity to BuyLearn more

Use Case #7Trial-to-Paid ConversionLearn more

Next#2 Account Scoring

Topics covered

lead scoring AIpredictive lead scoringlead conversion predictiongraph neural network lead scoringPQL lead scoringrelational deep learningKumoRFMautomated feature engineeringB2B lead scoringsales pipeline predictionCRM AI

From a leadership team with proven experience

Vanja Josifovski

CEO and Co-Founder, ex-CTO Airbnb, ex-CTO Pinterest

Jure Leskovec

Co-Founder & Chief Scientist, Stanford Professor

Hema Raghavan

Co-Founder & Head of Engineering, ex-AI Lead, LinkedIn

One Platform. One Model. Infinite Predictions.

KumoRFM

Relational Foundation Model

Turn structured relational data into predictions in seconds. KumoRFM delivers zero-shot predictions that rival months of traditional data science. No training, feature engineering, or infrastructure required. Just connect your data and start predicting.

For critical use cases, fine-tune KumoRFM on your data using the Kumo platform and Research Agent for 30%+ higher accuracy than traditional models.

Book a demo and get a free trial of the full platform: research agent, fine-tune capabilities, and forward-deployed engineer support.

Book a Demo Try Free