In 2024, a $200M SaaS company's churn model had 92% accuracy. Their retention team was celebrating. Then they lost 15% of their ARR in a single quarter.
What happened? Their model was 92% accurate at predicting that non-churners wouldn't churn. It caught exactly zero of the customers who actually left. The dataset was 95% non-churners, so a model that just output "no churn" for every single row would have scored 95%. Their supposedly smart model was actually performing worse than the dumbest possible baseline.
This is not an edge case. It is the default outcome when you build churn models the way most tutorials teach you. And it is just one of the many ways churn prediction goes wrong in practice.
This guide covers everything that actually matters: the algorithms worth considering, the metrics that do not lie to you, 10 concrete methods to improve accuracy (with real numbers on each), and the fundamental shift in data architecture that separates models stuck at 70% from models that break 80%. No hand-waving. No "it depends" without telling you what it depends on.
The 5 types of churn (and why the definition matters more than the model)
Before you write a single line of code, you need to answer one question: what does "churn" mean for your business? Get this wrong, and your model will be technically correct and practically useless.
types_of_churn
| churn_type | what_it_means | example | how_to_detect_it |
|---|---|---|---|
| Voluntary | Customer actively decides to leave | User hits the cancel button | Behavioral signals: declining usage, support complaints, competitor research |
| Involuntary | Customer leaves due to payment failure | Credit card expires, 3 dunning attempts fail | Billing data: failed charges, expired cards, declined transactions |
| Revenue (MRR churn) | Dollar value of lost recurring revenue | Lost $5K of $50K MRR = 10% revenue churn | Plan downgrades, seat reductions, usage-based billing declines |
| Logo (customer churn) | Headcount of customers who left | 10 of 100 customers canceled = 10% logo churn | Binary classification: will this customer leave yes/no |
| Silent | Customer stops engaging but never formally cancels | E-commerce customer hasn't purchased in 6 months | Recency/frequency thresholds (e.g., no activity in 90 days) |
Voluntary churn is the most common ML target because it has the richest behavioral signals and the highest potential for intervention. Silent churn is the hardest to catch.
Silent churn is the carbon monoxide of SaaS. By the time you detect it, the customer is already gone. They stopped logging in three months ago but their annual contract auto-renewed, so they don't show up in your cancellation data. Your churn rate looks fine. Your NPS is dropping. And next renewal cycle, they are gone for good.
For subscription businesses, churn is easy to define: the customer canceled. For non-contractual businesses (e-commerce, marketplaces, freemium), you have to draw a line. "No purchase in 90 days" is churn. "No login in 30 days" is churn. Pick the wrong threshold and your model either cries wolf (too short) or shows up too late to help (too long).
The 7 churn prediction algorithms, honestly compared
Every churn prediction tutorial starts with algorithms, so let's get this out of the way. Here's the truth: the algorithm matters less than you think. The difference between a well-tuned logistic regression and a well-tuned XGBoost on the same features is maybe 5-8 AUROC points. The difference between bad features and good features on the same algorithm is 15-25 points.
That said, you need to pick one. Here's the honest rundown.
churn_prediction_algorithms_compared
| algorithm | the_honest_take | typical_AUROC | when_to_use_it | when_to_skip_it |
|---|---|---|---|---|
| Logistic Regression | The Honda Civic of ML. Reliable, interpretable, gets you from A to B. Won't win any races. | 65-72% | Regulated industries, quick baselines, when you need to explain every coefficient to compliance | When you have strong non-linear interactions in your data |
| Decision Trees | Great for generating business rules your CS team can actually act on. Terrible as a standalone predictor. | 60-70% | When you need 'if X and Y then churn' rules for playbooks | Production scoring. Always use an ensemble instead. |
| Random Forest | The reliable mid-tier option. Hard to screw up, rarely the best, never embarrassing. | 70-78% | When you want something better than logistic regression without tuning 47 hyperparameters | When you have the time to tune XGBoost properly |
| XGBoost / LightGBM | The workhorse. If you can only pick one algorithm for a flat table, pick this. It wins Kaggle competitions for a reason. | 72-82% | Production churn models on flat feature tables. This is the default. | When interpretability is a hard requirement (use logistic regression instead) |
| Neural Networks (MLP/LSTM) | MLPs rarely beat XGBoost on tabular data. LSTMs can model engagement trajectories that aggregates miss. | 70-80% | When you have rich sequential data (clickstreams, session logs) and enough volume (100K+ users) | Small datasets. Tabular data without temporal sequences. |
| Survival Analysis | Answers 'when will they churn?' not just 'will they churn?' Underrated for contract businesses. | 65-75% | Subscription businesses with varied contract lengths, when timing matters for intervention | Non-contractual businesses where churn is binary |
| Graph Neural Networks | The new kid that actually delivers. Sees what other models can't: the relationships between entities. | 75-88% | Multi-table relational data, social/network churn, when flat-table models have plateaued | Single-table problems with no relational structure |
Highlighted: XGBoost is the current standard for flat tables. GNNs achieve higher AUROC by reading relational data that flat-table models cannot access.
Notice the AUROC ranges overlap. A well-featured logistic regression can beat a poorly-featured XGBoost. The algorithm is not the magic. The features are.
Metrics that actually matter (and the one that lies to your face)
Remember our 92%-accurate model from the opening? It was doing exactly what the accuracy metric rewarded: getting the majority class right. On imbalanced data, accuracy is a con artist. Here's what to use instead.
churn_model_evaluation_metrics
| metric | what_it_measures | the_analogy | when_to_use_it | watch_out_for |
|---|---|---|---|---|
| Accuracy | % of all predictions that were correct | Like grading a spam filter by counting how many non-spam emails it let through. Technically correct, completely useless. | Never as your primary metric on imbalanced data | A model predicting 'no churn' for everyone scores 95% on a 5% churn dataset |
| Precision | Of customers we flagged, how many actually churned? | A sniper rifle. When you fire, you hit. Low false alarm rate. | When retention offers cost real money ($500+ per customer) | You can get 100% precision by only flagging the single most obvious churner |
| Recall | Of customers who actually churned, how many did we catch? | A dragnet. You catch everything, even some fish you didn't want. | When losing a customer costs $50K+ in ARR and a false alarm costs a $5 email | You can get 100% recall by flagging literally everyone |
| F1 Score | Harmonic mean of precision and recall | The negotiator between the sniper and the dragnet. Balances both. | When you want one number that penalizes lopsided models | Doesn't account for the relative cost of false positives vs. false negatives |
| AUC-ROC | How well the model ranks churners above non-churners across all thresholds | Your model's GPA. 50 is an F (random guessing). 70 is a C+. 80 is a B+. 90+ is Dean's List. | Comparing models, reporting to leadership, general-purpose evaluation | Can look optimistic on severely imbalanced data (>97% non-churners) |
| PR-AUC | Precision-Recall tradeoff across all thresholds | Like AUC-ROC but focused only on the minority class. Ignores the easy 'not churn' predictions entirely. | Highly imbalanced data (>95% non-churners). The most honest metric. | Harder to interpret. A 'good' PR-AUC depends heavily on the base churn rate. |
| Lift at k% | How many more churners the model finds in the top k% vs. random | If your model has 5x lift at 10%, the top 10% of scores contains 5x more churners than a random 10% would. | When you have a fixed intervention budget (e.g., 'we can call 200 customers this month') | Only measures performance at one operating point, not across the full spectrum |
AUC-ROC is the standard comparison metric. PR-AUC is more honest for imbalanced data. Lift at k% maps most directly to business impact. Never use accuracy alone.
Choosing the right metric for your business
The metric you optimize determines the model you get. Here's the decision framework:
which_metric_for_which_scenario
| your_situation | optimize_for | why | threshold_strategy |
|---|---|---|---|
| Retention offer costs $500+ per customer (dedicated CSM, big discount) | Precision | Each false positive wastes real money. Be sure before you spend. | High threshold (0.7-0.8). Only flag high-confidence churners. |
| Retention action costs ~$0 (email, in-app nudge, small coupon) | Recall | False positives cost nothing. Missing a real churner costs $50K ARR. | Low threshold (0.3-0.4). Cast a wide net. |
| Tiered interventions (different actions at different risk levels) | AUC-ROC + calibrated probabilities | You need accurate ranking across the full risk spectrum. | Multiple tiers: >0.8 = CSM call, 0.5-0.8 = targeted email, 0.3-0.5 = in-app nudge |
| Reporting model quality to leadership | AUC-ROC + lift at 10% | AUC-ROC for comparability. Lift for 'so what does this mean in dollars?' | Report both. AUC-ROC for the data team. Lift for the exec summary. |
| Severely imbalanced data (>97% non-churners) | PR-AUC | AUC-ROC will flatter your model. PR-AUC tells the truth. | Use PR-AUC for model selection. Report AUC-ROC alongside for context. |
There is no single best metric. The right choice depends on what a false positive and a false negative cost your business.
10 proven methods to improve churn prediction accuracy
These are ordered from quickest wins to the most transformative changes. Methods 1-9 work within the flat-table paradigm. Method 10 changes the paradigm entirely.
1. Use behavioral features, not demographics
Predicting churn from demographics (age, location, company size) is like predicting whether someone will quit their job based on their shoe size. There might be a weak correlation buried in there somewhere, but you are missing everything that matters.
Behavioral features are where the signal lives: logins_last_7d, features_used_last_30d, avg_session_duration_trend, days_since_last_key_action. These will outperform industry, company_size, and region in virtually every churn model.
Typical improvement: 5-10 AUROC points over demographics-only baselines. This is usually the single biggest jump you will see from a single change.
2. Add time windows (the most underrated technique)
A single aggregate like total_logins is a photograph. Time-windowed features are a movie. Compute every behavioral metric at 7-day, 30-day, and 90-day windows. Then compute the ratios between them.
If logins_7d / logins_30d > 0.5, the customer is accelerating. If logins_7d / logins_30d < 0.1, they are fading. That ratio contains more signal than either raw count alone.
Typical improvement: 3-5 AUROC points over static aggregates. Cheap to implement, high return. If you are not doing this, do this first.
3. Handle class imbalance (or your model will cheat)
Churn datasets are typically 3-10% positive class. Without intervention, your model will learn the path of least resistance: predict "no churn" for everyone and collect a 93% accuracy trophy.
class_imbalance_techniques
| technique | how_it_works | when_to_use_it | code_hint |
|---|---|---|---|
| Class weights | Penalizes the model more for missing churners | First thing to try. Always. | class_weight='balanced' (sklearn) or scale_pos_weight=19 (XGBoost, for 5% churn rate) |
| SMOTE | Generates synthetic minority samples by interpolating between existing churners | When class weights alone are not enough. 5-15% positive class. | from imblearn.over_sampling import SMOTE |
| ADASYN | Like SMOTE but focuses on harder-to-classify boundary regions | When SMOTE underperforms. Complex decision boundaries. | from imblearn.over_sampling import ADASYN |
| Undersampling | Randomly removes majority class samples | Very large datasets where training time is a bottleneck | from imblearn.under_sampling import RandomUnderSampler |
Start with class weights. They require zero data modification and work with any algorithm. Move to SMOTE only if weights are insufficient.
Typical improvement: 2-8 AUROC points, with the largest gains on severely imbalanced datasets (97%+ non-churners). The improvement shows up in recall and F1, not accuracy.
4. Engineer interaction features
Individual features capture signals in isolation. Interaction features capture how signals talk to each other. Three types consistently improve churn models:
- Ratios:
support_tickets / total_orders(support burden per transaction),refunds / purchases(refund rate) - Trends:
usage_this_month / usage_last_month(MoM change),logins_7d / logins_30d * 4.3(weekly trend, annualized) - Deltas:
current_plan_price - avg_price_paid_historically(catches recent upgrades and downgrades)
Typical improvement: 1-3 AUROC points. Diminishing returns after 10-15 good interaction features.
5. Stack your models
Ensemble methods are the free lunch of ML. Train XGBoost, LightGBM, and logistic regression independently. Feed their outputs into a meta-learner (usually logistic regression). Each model captures different patterns. The meta-learner figures out when to trust each one.
Typical improvement: 1-3 AUROC points over the best single model. Almost always worth it in production. The cost is inference latency, not accuracy.
6. Calibrate your probabilities
Most models produce scores, not probabilities. An XGBoost output of 0.8 does not mean an 80% chance of churn. It means "this customer ranks in the top 20% of churn risk." If you use these scores for expected value calculations ("this customer has 70% churn probability and $10K annual value, so expected loss is $7K"), you will make bad decisions.
Apply Platt scaling or isotonic regression on a held-out validation set. Now your 0.8 actually means 80%.
Typical improvement: 0 AUROC points (calibration does not change ranking). But it can dramatically improve the quality of business decisions downstream.
7. Add RFM features (old-school, still deadly)
Recency, Frequency, Monetary. This framework predates machine learning by decades, and it still shows up in the top 10 feature importances of nearly every churn model. Days since last purchase. Transactions per month. Average order value. Simple, powerful, and often overlooked by teams chasing fancier features.
Typical improvement: 2-4 AUROC points if you did not already have these. Often zero incremental lift if you already have good behavioral features (RFM is a subset of behavioral).
8. Bring in support and NPS data
Support interactions are among the strongest churn predictors, and they are almost always in a different table from your user activity data. Ticket count in the last 30 days, average resolution time, CSAT scores, whether any ticket was escalated, NPS responses.
A customer who filed 3 tickets in the last week with an average CSAT of 2/5 has a dramatically higher churn probability than their login data alone suggests. The information is there. Most teams just don't join it in.
Typical improvement: 2-5 AUROC points. Higher if you are in a support-heavy business (B2B SaaS, telecom).
9. Use cohort-based features
Raw numbers lie without context. A customer logging in 3 times per week sounds healthy until you learn that the average for their cohort (same industry, same plan, same tenure) is 12 times per week. Now that same customer is at the 25th percentile.
Compute percentile ranks within cohorts for your key behavioral features. login_frequency_pctile_in_cohort carries more signal than login_frequency alone.
Typical improvement: 1-3 AUROC points. The value is highest when your customer base is heterogeneous (different industries, plan tiers, use cases).
10. Connect your tables (the paradigm shift)
Methods 1-9 are like trying to understand a movie by looking at screenshots. Method 10 is watching the whole movie. The plot - the connections between characters, the sequence of events, the subplots - only makes sense when you see the full picture.
Predicting churn from a flat feature table is like diagnosing a patient by only looking at their height and weight. You will catch the obvious cases, but miss everything that actually matters: their blood work, their family history, what medications their doctor prescribed, whether their spouse just got diagnosed with the same condition.
All the methods above work within a single flat table where each customer is one row. No matter how clever your feature engineering, that table cannot represent the relationships between customers, products, support interactions, and billing events as they exist in your database.
Relational and graph-based approaches remove this constraint. They read the connected table structure directly. Three categories of signal unlock:
- Social churn: When a customer's peers churn, their own risk spikes. This signal lives in the connections between users and is completely invisible in a flat table.
- Product quality propagation: A batch of products with high return rates raises the churn risk of every customer who bought from that batch, even if their individual behavior looks fine.
- Multi-hop patterns: Customer contacted support about a product that other customers also complained about, and those customers churned. Three hops. Invisible in a flat table. Obvious in a graph.
Typical improvement: 15-26% relative AUROC gain over flat-table models on the same data. This is not incremental. This is a step function.
The relational advantage: why connected data changes everything
Traditional churn models look at each customer through a keyhole. Relational models knock down the wall.
Here is what that means concretely.
Social churn: the signal your flat table cannot see
Bob is a gym member. His visit frequency dropped 68% this month and he downgraded from premium to basic. A flat-table model sees these two data points and assigns moderate risk. Maybe 55% churn probability.
But in the graph, there is a signal the flat table literally cannot represent: 2 of Bob's 3 regular workout partners canceled their memberships in the last 2 weeks. And the instructor for Bob's favorite Tuesday class left the gym last month. 40% of that class's regulars have since churned.
Each signal alone is ambiguous. Together, in the graph, they converge to 82% churn probability. The social signal - his network is dissolving - is the strongest predictor. And it exists only in the relationships between records. No amount of feature engineering on a flat customer table will find it.
The backward window technique (focus on saveable customers)
One of the most common mistakes in churn prediction is including customers who have already effectively churned but have not formally canceled. Your model learns to detect inactivity (easy) instead of predicting future churn (hard). It inflates your metrics and deflates your impact.
The fix: filter predictions to only include customers who were active within a defined recent window. Only predict churn for customers you can still save.
PQL Query
PREDICT churn_30d FOR EACH customers.customer_id WHERE customers.last_active > now() - 60d
This query predicts 30-day churn only for customers active in the last 60 days. The backward window filter eliminates noise from already-gone customers and focuses the model on the population where intervention can actually change the outcome.
Output
| customer_id | churn_probability | top_driver | recommended_action |
|---|---|---|---|
| C-4501 | 0.82 | 2 of 3 connected users churned last week | CSM outreach with retention offer |
| C-4502 | 0.71 | Support ticket escalated, unresolved 5 days | Priority support resolution |
| C-4503 | 0.45 | Usage declined 40% month-over-month | Targeted re-engagement email |
| C-4504 | 0.12 | Expanding usage, added 2 team members | No action (healthy) |
The benchmark: flat features vs. relational approach
The RelBench benchmark provides standardized comparisons on real-world relational datasets. On the H&M retail churn prediction task:
relbench_hm_churn_benchmark
| approach | AUROC | feature_engineering_required | what_it_captures |
|---|---|---|---|
| LightGBM + manual features | 55.21% | Yes (extensive joins, aggregations, time windows) | Static aggregates from flattened tables |
| Relational approach | 69.88% | No (reads raw relational tables directly) | Multi-table patterns, temporal dynamics, entity relationships |
The relational approach achieves 69.88% vs 55.21%. That 14.67-point gap is not a better algorithm. It is richer data.
Let that number sink in. 14.67 percentage points. On the same underlying data. The only difference is that one approach flattened it into a single table and the other read the connected structure. No algorithm swap, no hyperparameter tuning, no ensemble trick comes close to that gap.
Flat-table churn model
- One row per customer with aggregated features
- Requires manual SQL joins and feature engineering
- Cannot see social churn or peer behavior signals
- Cannot capture multi-hop patterns (customer > product > other customers)
- Typical AUROC: 55-78% depending on feature quality
Relational churn model
- Reads all connected tables directly as a graph
- No manual feature engineering required
- Captures social churn and network effects natively
- Discovers multi-hop patterns through message passing
- Typical AUROC: 70-88% on relational data
Churn prediction tools: an honest comparison
The right tool depends on your team, your data, and your budget. Not everything needs a graph neural network. Sometimes a spreadsheet with last-login dates and a motivated CS manager will outperform a million-dollar ML platform. Here's the honest breakdown.
churn_prediction_tools_compared
| tool | type | price | best_for | honest_limitation |
|---|---|---|---|---|
| scikit-learn | Open-source library | Free | Prototyping, learning, baselines. Start here if you are building your first model. | You build and maintain everything yourself. No pipeline, no monitoring, no deployment. |
| XGBoost / LightGBM | Open-source library | Free | Production flat-table models. The accuracy standard for tabular data. | Still needs a feature table. Feature engineering is your problem. |
| H2O.ai | AutoML platform | Free (OSS) / Enterprise | Automated model selection when you have a feature table but limited ML headcount. | Does not automate feature engineering. You still build the flat table. |
| DataRobot | AutoML platform | Enterprise pricing | ML lifecycle management with governance. Large enterprises with compliance needs. | Expensive. Automates model selection, not the feature engineering that takes 80% of time. |
| Mixpanel / Amplitude | Product analytics | Free tier / $25+/mo | Behavioral tracking and cohort analysis. Excellent as a data source for churn features. | Not a prediction tool. You still need to build or buy the ML layer. |
| ChurnZero | Customer success platform | Enterprise pricing | CS workflows, health scoring, churn alerts. Built for CS teams, not data scientists. | Rule-based, not ML-based. Limited to the health score signals you configure manually. |
| Gainsight | Customer success platform | Enterprise pricing | Health scoring, playbooks, renewal management at scale. | Same limitation as ChurnZero: health scores are rule-based heuristics, not ML predictions. |
| Kumo.ai | Relational foundation model | Free tier / Enterprise | Multi-table predictions without feature engineering. Reads relational data natively. | Requires relational data. If your data is already a single clean table, XGBoost is simpler. |
Highlighted: Kumo.ai is the only tool that reads relational data natively, eliminating feature engineering. But if your data is a single flat table, XGBoost is the pragmatic choice.
Picking the right tool for your situation
- No data science team, need churn alerts now: ChurnZero or Gainsight. Configure health scores based on login recency and support ticket volume. Not ML, but effective for the obvious cases.
- Data scientist available, flat feature table ready: XGBoost or LightGBM for maximum control. H2O or DataRobot if you want automated model selection on top.
- Multi-table data, want maximum accuracy without months of feature engineering: Kumo.ai reads your relational database directly and captures cross-table signals that flat-table tools miss.
- Starting from scratch with no event tracking: Mixpanel or Amplitude first to instrument behavioral data. You cannot predict churn from data you do not have.
The 8 deadly sins of churn modeling
These mistakes are everywhere. I have seen each one in production models at real companies. Some of them are silent killers that make your metrics look great while your model does nothing useful.
1. The Accuracy Trap
The model scores 95% accuracy. The team celebrates. The model catches zero churners. This is the most common mistake in churn prediction and we have already beaten it to death in this guide, so just remember: if your dataset is 95% non-churners, a model that prints "no churn" for every row gets 95%. Use AUC-ROC, F1, or PR-AUC.
2. The Zombie Problem
Your training data includes customers who stopped using the product 6 months ago but never formally canceled. They are zombies: dead but still on your customer list. The model learns to detect inactivity (trivially easy) and reports stellar metrics. In production, it flags the zombies your CS team already knows about and misses the at-risk customers who are still active. Fix: backward window filter. Only score customers active in the last 60 days.
3. Time Traveling
Temporal leakage. You used total_orders_in_churn_month as a feature, which means you used future data to predict the past. Your model looks incredible in your notebook and falls apart in production because it no longer has access to a time machine. Fix: always split by time, not randomly. Every feature must be computable at the prediction timestamp using only past data.
4. The Random Split Delusion
You split train/test randomly, so your January churn behavior leaks into predicting February churn through autocorrelation. The model memorizes temporal patterns it will not have access to in production. Fix: chronological split. Train on months 1-6, validate on month 7, test on month 8.
5. The Uncalibrated Confidence
Your model says 0.8 churn probability. You tell the CFO this customer has an 80% chance of leaving and the expected revenue loss is $40K. But 0.8 is a rank score, not a probability. The actual churn rate for customers scoring 0.8 might be 45%. You just doubled the expected loss estimate. Fix: Platt scaling or isotonic regression on a holdout set.
6. The Lab-Only Model
A perfect churn model with no intervention is a very expensive way to watch customers leave with better visibility. The model identifies who will churn. The intervention determines whether they stay. A/B test your retention actions as rigorously as you evaluate your model. The best churn programs iterate on both simultaneously.
7. The Fossil Feature Table
The feature table was built 18 months ago. The product has shipped 4 major features since then. None of them are tracked. The model is optimizing on stale signals while the actual churn drivers have shifted. Fix: review and refresh your feature set quarterly. Add features for every major product change.
8. The Flat-Table Ceiling
Your team has spent 6 months tuning XGBoost, trying 200 feature combinations, running grid searches, stacking ensembles. AUROC went from 72% to 76%. They are stuck. The problem is not the algorithm or the features. The problem is that the flat table cannot represent the signals that would push past 80%: social churn, multi-hop patterns, product-quality propagation. The next improvement requires a different architecture, not more tuning.