Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
Learn14 min read

How to Predict Customer Lifetime Value from Transaction Data with ML

RFM analysis works when you have one product and one table. Most businesses have dozens of tables, hundreds of product categories, and customers who influence each other through referrals, households, and shared behavior patterns. Here is how to move from single-table CLV estimation to graph-based CLV prediction, and how to use accurate CLV to personalize outreach timing, channel, and offer for every customer.

TL;DR

  • 1RFM analysis and BG/NBD models estimate CLV from a single transaction table. They work for simple, single-product businesses. They miss product affinity, referral value, household effects, and cross-category patterns.
  • 2XGBoost on engineered features improves on RFM by incorporating more signals, but it still operates on a flat table. A customer's lifetime value depends on their network: what similar customers do, who they referred, and how their engagement compares to cohort patterns. Flat tables cannot represent networks.
  • 3Graph-based ML reads the full customer-product-category-referral graph. On the SAP SALT benchmark, KumoRFM achieves 91% accuracy vs 75% for PhD data scientists with XGBoost and 63% for LLM+AutoML. On RelBench, KumoRFM zero-shot scores 76.71 AUROC vs 62.44 for LightGBM with manual features.
  • 4Once you have accurate CLV, the next step is personalizing outreach: predicting optimal timing (when engagement probability peaks), best channel (email vs push vs in-app), and best offer (which products maximize long-term value). These are all multi-table prediction tasks that graph-based models handle natively.
  • 5KumoRFM reads raw relational tables directly. No feature engineering, no graph construction, no separate models for CLV, timing, and channel. One platform, PQL queries, done.

Every CLV guide starts the same way: compute recency, frequency, and monetary value. Score each customer on those three dimensions. Segment them into buckets. Multiply average order value by purchase frequency by expected lifespan. Done.

And for a single-product subscription business with one transaction table, that actually works. The problem is that almost no real business looks like that. You have customers buying across product categories. You have referral programs where one customer brings in five others. You have households where three family members share an account or influence each other's purchases. You have engagement data across email, push, in-app, and SMS. You have product catalogs with thousands of items.

A customer's lifetime value is not just about what they bought. It is about what they will buy next, who they will bring with them, and how their behavior compares to thousands of similar customers across dozens of dimensions. That is a graph problem. And most CLV tools treat it as a spreadsheet problem.

Four approaches to CLV prediction, compared directly

Before getting into the details, here is a head-to-head comparison of the four main approaches to predicting customer lifetime value.

clv_prediction_approaches_comparison

dimensionRFM AnalysisStatistical (BG/NBD + Gamma-Gamma)ML on Flat Features (XGBoost)Graph-Based ML (KumoRFM)
Input dataSingle transaction tableSingle transaction tableFlat feature table (engineered from multiple sources)Raw relational tables (customers, transactions, products, referrals)
What it capturesRecency, frequency, monetary valuePurchase rate, dropout probability, monetary distributionEngineered features: RFM + behavioral + demographicFull customer-product-category-referral-household graph
Handles product affinityNoNoOnly if manually engineeredYes, automatically from product and category tables
Captures referral valueNoNoOnly if manually engineeredYes, reads referral graph directly
Captures household effectsNoNoOnly if manually engineeredYes, reads household linkage tables
Handles irregular timingPoorlyYes, core strengthIf time-based features are engineeredYes, learns temporal patterns from raw timestamps
Feature engineering requiredMinimalMinimalHeavy (12+ hours per task on average)None
Accuracy on enterprise dataBaselineBaseline + 5-10%75% (SAP SALT, PhD data scientist)91% (SAP SALT, zero-shot)
Can personalize timing/channelNoNoRequires separate modelsYes, additional PQL queries against same tables
Team requiredAnalystAnalyst or data scientist2-3 data scientists1 ML engineer or analyst

Four approaches to CLV prediction compared across 10 dimensions. Each step up captures more relational context. Graph-based ML captures the full network of relationships that drive lifetime value.

Approach 1: RFM analysis

RFM analysis scores customers on three dimensions: how recently they purchased (Recency), how often they purchase (Frequency), and how much they spend (Monetary value). Each dimension gets a score (typically 1-5), and customers are segmented into buckets like "Champions" (high R, high F, high M) or "At Risk" (low R, high F, high M).

RFM is a good starting point because it is simple, interpretable, and requires only a single transaction table. You can implement it in SQL in an afternoon. For businesses with a single product and regular purchase cycles (think: weekly grocery delivery, monthly subscription box), RFM segments correlate reasonably well with future value.

But RFM has hard limits. It treats all purchases as interchangeable. A customer who buys five units of your cheapest item and a customer who buys five units of your most expensive item get the same Frequency score. It ignores product categories entirely. And it is a segmentation tool, not a prediction model. It tells you who your best customers were. It does not predict who your best customers will be.

Approach 2: Statistical models (BG/NBD + Gamma-Gamma)

The BG/NBD (Beta-Geometric/Negative Binomial Distribution) model is the academic standard for CLV prediction. It models two things simultaneously: the rate at which a customer makes purchases and the probability that a customer has "died" (stopped being a customer forever). Paired with the Gamma-Gamma model, which estimates the distribution of monetary value per transaction, you get a dollar-denominated CLV estimate.

This is a real improvement over raw RFM. The BG/NBD model handles irregular purchase timing, accounts for customer dropout, and produces calibrated probability estimates. The Python lifetimes library makes it accessible. If you have a single transaction table and need a CLV number per customer, this is the best single-table approach.

The limitation: it still reads one table. It does not know what products customers bought, which categories they browse, whether they referred anyone, or how their engagement patterns compare to similar customers. For a SaaS business with one subscription product, that is fine. For a multi-category retailer or a marketplace, it leaves most of the predictive signal on the floor.

Approach 3: XGBoost on engineered CLV features

The data science team approach: extract features from multiple tables, flatten everything into one row per customer, and train XGBoost (or LightGBM, or a random forest) to predict future value. Typical features include RFM metrics plus average order value, purchase frequency by category, days between orders, channel engagement rates, support ticket counts, and demographic data.

This works better than RFM and BG/NBD because it can incorporate more signals. The accuracy improvement depends entirely on how good your features are, which depends on how much time your data scientists spend on feature engineering. On the SAP SALT benchmark, PhD data scientists with XGBoost achieve 75% accuracy. That is real skill and real work.

But the flat table constraint bites here too. Consider what you lose when you flatten a customer's relationship network into columns:

  • Product affinity is reduced to counts. You can compute "number of categories purchased" or "top category by spend." But you cannot represent that Customer A buys running shoes and then running watches and then fitness trackers, following a sports-gear trajectory that predicts high future LTV. The trajectory is a path through a product graph. A column is a number.
  • Referral value is invisible. A customer who referred three friends who each became high-value customers is worth far more than their own purchases suggest. That referral tree is a graph structure. You can count "num_referrals = 3," but you cannot represent the quality and depth of the referral network in flat columns.
  • Cohort patterns are lost. Customers who joined during the same campaign, bought similar first products, and follow similar engagement curves have correlated future behavior. These cohort patterns are visible in a graph of customer-product-time relationships. They are not visible in individual feature rows.
  • Household influence disappears. If three members of a household are customers and two have churned, the third is at elevated risk. That signal lives in the household graph. A flat table does not know that these customers are connected.

Approach 4: Graph-based ML (KumoRFM)

Graph-based ML reads the actual network of relationships between customers, products, categories, referrals, and households. Instead of flattening this network into a feature table, it operates on the graph directly, learning which patterns in the network predict high lifetime value.

Think about what a customer's lifetime value actually depends on:

  1. What products they buy. Not just how many, but which ones and in what sequence. A customer moving from basic items to premium items is on a different trajectory than one buying the same basic item repeatedly.
  2. What similar customers do. Customers with similar purchase histories tend to have similar future behavior. This is the collaborative filtering insight, but applied to CLV instead of recommendations.
  3. Whether they refer others. Referral behavior is a strong signal of engagement and advocacy. And the value of referred customers compounds the referrer's effective LTV.
  4. How their engagement trajectory compares to cohort patterns. Is this customer's 90-day engagement curve tracking the "high-value" cohort or the "churn-risk" cohort? The answer comes from comparing trajectories across the customer graph.

Every one of these signals is a graph pattern. Product sequences are paths through the product graph. Customer similarity is neighborhood overlap in the customer-product bipartite graph. Referral value is a tree structure. Cohort comparison is a temporal pattern across connected nodes.

KumoRFM reads raw relational tables and automatically constructs the heterogeneous graph that connects them. It discovers which graph patterns predict lifetime value for your specific data, without any feature engineering or graph construction on your part.

The benchmark evidence

The SAP SALT benchmark tests prediction accuracy on real enterprise relational data across multiple task types, including customer value prediction. Here is how the approaches compare:

sap_salt_benchmark_clv

approachaccuracywhat_it_means
LLM + AutoML63%Language model generates features, AutoML selects model
PhD Data Scientist + XGBoost75%Expert spends weeks hand-crafting features, tunes XGBoost
KumoRFM (zero-shot)91%No feature engineering, no training, reads relational tables directly

SAP SALT benchmark: KumoRFM outperforms expert-tuned XGBoost by 16 percentage points. The gap comes from relational patterns (product affinity, referral networks, cohort dynamics) that flat feature tables cannot represent.

On the RelBench benchmark across 7 databases and 30 prediction tasks:

relbench_benchmark_clv

approachAUROCfeature_engineering_time
LightGBM + manual features62.4412.3 hours per task
KumoRFM zero-shot76.71~1 second
KumoRFM fine-tuned81.14Minutes

KumoRFM zero-shot outperforms manually engineered LightGBM by 14+ AUROC points. Fine-tuning pushes the gap to nearly 19 points.

From CLV prediction to personalized outreach

Accurate CLV is not the end goal. It is the foundation. Once you know which customers are worth the most, the question becomes: how do you maximize that value? The answer is personalized outreach, and it breaks into three prediction problems.

  1. When to reach out (timing). Every customer has engagement patterns. Some check email Monday mornings. Some engage with push notifications during commute hours. Some browse your app late at night. Predicting when a specific customer is most likely to engage is a prediction task on engagement log data, correlated with purchase data, product views, and session history. This is inherently a multi-table problem.
  2. Which channel to use. Email, push notification, in-app message, SMS, direct mail. Each customer has different channel preferences, and those preferences vary by context. A customer might prefer email for product recommendations but push for flash sales. Predicting optimal channel per customer per message type requires reading interaction history across all channels, correlated with conversion outcomes.
  3. What to offer. Which product, promotion, or content maximizes long-term value (not just immediate conversion)? This requires understanding product affinity, purchase trajectories, and how different offers affect retention. A 20% discount might drive a purchase but train the customer to wait for discounts, lowering long-term LTV. The optimal offer depends on the customer's position in the product graph and their similarity to other customer trajectories.

With traditional ML, each of these is a separate model with separate feature engineering. With KumoRFM, each is a PQL query against the same connected tables.

Traditional CLV + outreach pipeline

  • Compute RFM scores from transaction table (1-2 days)
  • Engineer features from 4-6 tables for XGBoost (2-3 weeks)
  • Train CLV model, validate, deploy (1-2 weeks)
  • Build separate timing model with engagement features (2-3 weeks)
  • Build separate channel preference model (2-3 weeks)
  • Build separate offer optimization model (2-3 weeks)
  • Maintain 4 pipelines, 4 feature stores, 4 model versions

KumoRFM CLV + outreach

  • Connect to data warehouse: customers, transactions, products, engagement, referrals
  • Write PQL: PREDICT ltv_12_month FOR EACH customers.customer_id
  • Write PQL: PREDICT best_engagement_hour FOR EACH customers.customer_id
  • Write PQL: PREDICT best_channel FOR EACH customers.customer_id
  • Write PQL: PREDICT next_best_product FOR EACH customers.customer_id
  • One platform, same tables, no feature engineering, no separate pipelines

PQL Query

PREDICT ltv_12_month
FOR EACH customers.customer_id
WHERE customers.signup_date > '2025-01-01'

One PQL query predicts 12-month customer lifetime value by reading the full customer-product-category-referral graph. KumoRFM discovers which relational patterns (product trajectories, referral networks, cohort dynamics) predict future value for your specific data.

Output

customer_idpredicted_ltv_12mltv_rfm_estimatewhy_kumo_differs
C-4401$2,840$1,200Cross-category trajectory matches high-LTV cohort pattern
C-4402$380$950Referral network is inactive; cohort engagement declining
C-4403$4,200$800Referred 3 high-value customers; product trajectory is accelerating
C-4404$150$600Household members churned; engagement pattern matches dropout cohort

Why CLV is a graph problem

The core insight is simple: a customer's future value depends on their position in a network, not just their own transaction history. Here are the specific graph patterns that drive CLV prediction accuracy:

Product trajectory patterns

The sequence of products a customer buys traces a path through the product graph. Customers who follow certain paths (basic to mid-range to premium) have predictably different LTV than customers who follow other paths (premium one-time purchase, no repeat). These paths are graph structures.

  • Best for: Multi-category retailers and marketplaces where product mix predicts long-term value.
  • Watch out for: Requires a product catalog table linked to orders. Single-product businesses get limited signal here.

Referral network value

A customer who refers others creates a tree of downstream value. The depth and quality of that tree is a strong LTV predictor. A customer with three referrals who each referred two more is on a different trajectory than a customer with three referrals who all churned.

  • Best for: Businesses with referral programs, viral products, or ambassador networks.
  • Watch out for: Referral data must be tracked with attribution. If your referral table is incomplete, this signal is noisy.

Category co-purchase graphs

Which product categories are purchased together, and in what order? Customers who bridge multiple category clusters (buy from sports AND electronics AND home) tend to have higher LTV than single-category buyers. The category co-purchase structure is a graph.

  • Best for: Broad-catalog retailers where cross-category engagement correlates with retention.
  • Watch out for: Category taxonomy must be consistent. Inconsistent categorization creates false co-purchase patterns.

Cohort similarity in the customer-product bipartite graph

Two customers with overlapping purchase histories (buying many of the same products) tend to have correlated future behavior. This is neighborhood overlap in the customer-product graph, and it is far richer than any flat-table similarity metric.

  • Best for: Subscription businesses and marketplaces with large customer bases where collaborative patterns emerge.
  • Watch out for: Needs sufficient transaction volume per customer. New customers with 1-2 purchases have sparse graph neighborhoods.

Household and social influence

Customers connected through households, shared accounts, or social referral links influence each other. When connected customers start churning, the remaining connected customers are at higher risk. This signal propagates through the graph.

  • Best for: Businesses with household accounts, family plans, or shared-membership products.
  • Watch out for: Requires household or account-linkage data. Without explicit linkage tables, this signal is unavailable.

Getting started: what you need

Here is the practical checklist for moving from single-table CLV estimation to graph-based CLV prediction with personalized outreach:

  1. Audit your tables. At minimum you need customers and transactions. Each additional table adds predictive signal: products, categories, engagement logs, referrals, household linkages, support tickets.
  2. Define your prediction target. Is it 6-month LTV? 12-month LTV? Revenue? Gross margin? Number of purchases? Be specific. The model optimizes for exactly what you ask it to predict.
  3. Connect your data warehouse. KumoRFM reads from Snowflake, BigQuery, Redshift, Databricks, and other major warehouses. No ETL pipelines required.
  4. Write your PQL queries. Start with CLV prediction, then add timing, channel, and offer prediction as separate queries against the same tables.
  5. Validate against your existing approach. Run KumoRFM predictions alongside your current RFM or XGBoost scores. Compare accuracy on a holdout period. The lift from relational signals is typically visible within the first test.

Frequently asked questions

How do I predict customer lifetime value from our transaction data?

Start by deciding how much relational context you can use. If you only have a single transactions table, RFM analysis or a BG/NBD statistical model will give you a baseline CLV score based on recency, frequency, and monetary value. If you can join in product, category, and customer tables, XGBoost on engineered features will improve accuracy. If you also have referral data, household connections, or multi-category purchase history, a graph-based approach like KumoRFM will outperform flat-table methods because it reads the full customer-product-category-referral network. On the SAP SALT benchmark, KumoRFM achieves 91% accuracy vs 75% for PhD data scientists using XGBoost on manually engineered features.

How do I personalize outreach timing and channel for each customer with ML?

Personalizing outreach timing and channel is a multi-table prediction problem. You need to predict three things per customer: when their engagement probability peaks (timing), which channel they are most likely to respond on (email, push, in-app, SMS), and which offer maximizes their long-term value. Each prediction draws on different tables: engagement logs for timing, channel interaction history for channel preference, and purchase history plus product catalog for offer selection. Traditional approaches require building separate models for each prediction. KumoRFM handles all three as PQL queries against the same connected tables, automatically discovering cross-table patterns like 'customers who bought product X respond better to push notifications on Tuesday mornings.'

What is RFM analysis and when does it work for CLV prediction?

RFM analysis segments customers by three metrics: Recency (how recently they purchased), Frequency (how often they purchase), and Monetary value (how much they spend). Each customer gets an RFM score, and you use those segments to estimate future value. RFM works well for simple, single-product businesses with repeat purchases like subscription boxes or coffee shops. It breaks down for multi-category retailers, marketplaces, or any business where a customer's value depends on what they buy, who they referred, or how they interact across channels. RFM treats all purchases as interchangeable and ignores network effects entirely.

What is the BG/NBD model for customer lifetime value?

The BG/NBD (Beta-Geometric/Negative Binomial Distribution) model is a statistical approach that predicts how many future purchases a customer will make based on their transaction history. It models two processes: the rate at which a customer makes purchases and the probability that a customer has become inactive. Paired with the Gamma-Gamma model for monetary value, it produces a dollar-denominated CLV estimate. The BG/NBD model is a real step up from raw RFM because it handles irregular purchase timing and customer dropout. But it still only reads a single transaction table. It cannot incorporate product categories, referral networks, or cross-channel behavior.

Why does XGBoost underperform on CLV prediction compared to graph-based models?

XGBoost operates on a flat feature table: one row per customer, columns for engineered features like total spend, average order value, days since last purchase, and purchase frequency. These features capture the customer in isolation. But CLV is a network phenomenon. A customer who refers three high-value friends is worth more than their own purchases suggest. A customer who buys across five product categories has different retention dynamics than one who buys from one category. A customer in a household where others are churning is at higher risk. These patterns live in the connections between customers, products, and categories. XGBoost cannot read connections. On RelBench, KumoRFM zero-shot achieves 76.71 AUROC vs 62.44 for LightGBM with manually engineered features.

Can I predict the best channel and timing for each customer without a data science team?

With traditional ML, predicting optimal channel and timing per customer requires building separate models, each with its own feature engineering pipeline. That typically needs 2-3 data scientists working for weeks. With KumoRFM, a single analyst can write PQL queries like PREDICT next_channel FOR EACH customers.customer_id or PREDICT days_to_next_engagement FOR EACH customers.customer_id. The foundation model discovers cross-table patterns automatically, including timing patterns from engagement logs, channel preferences from interaction history, and product affinity from purchase data. No feature engineering, no separate pipelines.

How does graph-based ML improve CLV prediction accuracy?

Graph-based ML reads the network of relationships between customers, products, categories, referrals, and households. This captures four types of signal that flat-table models miss: (1) product affinity patterns across categories, showing which product combinations predict high LTV, (2) referral network effects, where customers who refer others have different value trajectories, (3) household and social influence, where a customer's behavior correlates with their connections, and (4) cohort trajectory patterns, where engagement curves of similar customers predict future behavior. On the SAP SALT enterprise benchmark, these relational signals account for the gap between 75% accuracy (XGBoost with expert features) and 91% accuracy (KumoRFM reading relational tables directly).

What data do I need to predict CLV with a graph-based approach?

At minimum, you need a customers table and a transactions table with timestamps. That alone will outperform RFM because the model learns temporal patterns automatically. Adding a products table and order-items table lets the model learn product affinity and category-level patterns. Adding referral data (who referred whom) lets the model capture network value. Adding engagement logs (email opens, app sessions, support tickets) lets the model predict timing and channel. Adding household or account linkage data lets the model capture social influence. KumoRFM connects all these tables automatically through foreign key relationships. You do not need to build a graph database or write graph queries. You point it at your relational tables and it constructs the graph internally.

See it in action

KumoRFM delivers predictions on relational data in seconds. No feature engineering, no ML pipelines. Try it free.