Netflix estimates their recommendation engine is worth $1B per year in retained subscribers. Amazon attributes 35% of revenue to recommendations. Spotify's Discover Weekly has 40M+ listeners. These companies didn't get there with collaborative filtering. They got there despite it.
The dirty secret of recommendations is that the approach most tutorials teach - user-item matrix factorization - has been the bottleneck for a decade. It cannot handle new users. It cannot handle new items. It cannot explain its reasoning. And it treats every interaction the same whether someone spent 3 seconds or 3 hours with a product.
Here's what actually works, what doesn't, and what the gap between tutorial-grade and production-grade recommendation systems looks like in practice.
Why recommendations are harder than they seem
Building a recommendation engine is like being a concierge at a hotel with 10 million rooms and 100 million guests, where half the guests have never stayed before and new rooms open every hour. You need to match the right guest to the right room instantly, with incomplete information, while the rooms and guests keep changing underneath you.
Four problems make this harder than most ML tasks.
The cold-start problem
A new user signs up. They have zero purchase history, zero ratings, zero browsing data. Your collaborative filtering model needs interaction data to work. It has none. So it defaults to recommending bestsellers, which is the same experience the user would get without your recommendation system at all.
The same problem hits new items. You add 500 products to your catalog today. None of them have been purchased or reviewed. Your model cannot recommend them because no user has interacted with them. They sit invisible until enough people stumble onto them organically, which is exactly the problem recommendations were supposed to solve.
The popularity bias
Most recommendation systems have a rich-get-richer problem. Popular items get recommended. Recommended items get more clicks. More clicks make them more popular. The top 1% of your catalog gets 50% of the recommendations. The long tail - where most of the unique value lives - stays buried.
This is not just an aesthetic concern. If your system only recommends items users would have found anyway, the incremental value of your recommendation engine is zero. You spent 6 months building a system that recommends Harry Potter to people who already know Harry Potter exists.
The filter bubble
A user buys three thrillers. Your model recommends more thrillers. They buy those. Your model recommends even more thrillers. Six months later, this user's entire experience is thrillers, and they quietly stop engaging because the platform feels stale. The model killed exploration to maximize short-term engagement, and the user churned.
The filter bubble is recommendation-induced churn. Your optimization metric (click-through rate) went up while the thing that actually matters (long-term retention) went down. You cannot see this in offline metrics. You can only see it in cohort-level retention analysis months later.
The scale problem
10 million users times 1 million items is 10 trillion possible pairs. You cannot score all of them. Even if scoring takes 1 microsecond per pair, that is 115 days of compute for a single recommendation refresh. Production systems solve this with a two-stage architecture: a fast retrieval stage that narrows millions of items to hundreds of candidates, followed by a precise ranking stage that orders those candidates. Getting this architecture right is often harder than getting the model right.
The 5 recommendation approaches, honestly compared
Every recommendation tutorial walks through these approaches like a menu. Here's what they do not tell you: the first three are historical artifacts that most production systems have moved past. They are worth understanding for context, not for implementation.
recommendation_approaches_compared
| approach | the_honest_take | handles_cold_start | handles_scale | best_for |
|---|---|---|---|---|
| Collaborative Filtering | The approach from 2006 that refuses to die. 'Users who liked X also liked Y.' Simple. Plateaus fast. | No. Completely breaks. | Struggles above 1M users | Quick baselines, dense interaction data (music, video) |
| Content-Based | Recommends similar items based on attributes. Works great until everything looks the same. | Partially. New items work if they have attributes. New users still break. | Moderate | Catalog-heavy domains with rich item metadata (articles, jobs) |
| Matrix Factorization | The math behind Netflix Prize. Elegant. Killed by cold-start. | No. New users and items have no learned embeddings. | Good with ALS | When you have dense, explicit ratings data and a stable catalog |
| Deep Learning (NCF, Two-Tower) | Handles scale. Handles features. Still treats each user-item pair independently. | Partially. Can incorporate content features for new items. | Yes. Built for it. | Large-scale systems with rich feature sets (e-commerce, ads) |
| Graph Neural Networks | Finally - a model that sees why you liked what you liked, not just that you liked it. | Yes. New items connect through attributes, categories, and brands. | Yes, with neighbor sampling | Multi-entity data with rich relationships (e-commerce, social, marketplaces) |
Highlighted: GNNs are the only approach that reasons over the full relationship structure. Each previous approach solves one limitation but introduces another.
The progression tells a story. Collaborative filtering said "similar users like similar items." Content-based said "similar items share similar attributes." Matrix factorization said "we can learn those similarities from data." Deep learning said "we can learn at scale." Graph neural networks said "we can learn at scale AND see the structure of why things are connected."
Each step solved a real problem. But only the last step stopped treating each user-item pair as an isolated data point and started seeing the full web of relationships.
Metrics that actually measure recommendation quality
The metrics for recommendation systems are more nuanced than for classification. You are not predicting yes/no. You are producing an ordered list, and the order matters as much as the contents. Precision@10 asks "of the 10 movies I recommended, how many did you watch?" NDCG asks "did I put the one you'd love most at the top of the list?"
recommendation_metrics
| metric | what_it_measures | the_analogy | when_to_use_it | watch_out_for |
|---|---|---|---|---|
| Precision@K | Of top K recommendations, how many were relevant? | A chef serving a 5-course meal. Precision@5 asks how many courses you actually enjoyed. | When you have limited recommendation slots (email, homepage widget) | Ignores the order within the K items. Position 1 and position K count equally. |
| Recall@K | Of all relevant items, how many appeared in the top K? | A detective's case board. Recall@K asks what fraction of the suspects you have identified. | When users scroll or paginate through recommendations | Punishes systems with large catalogs. Recall@10 out of 50,000 items will always look low. |
| MAP@K (Mean Average Precision) | Average of precision calculated at each relevant item's position | Grading a DJ's setlist. You get more credit for playing the bangers early, not buried at track 15. | The standard offline metric. Use this for model comparison. | Sensitive to the number of relevant items per user. Users with few interactions dominate. |
| NDCG (Normalized Discounted Cumulative Gain) | Rewards relevant items more when they appear higher in the list | A search engine result page. The best link at position 1 is worth far more than the same link at position 10. | When ranking order matters and you have graded relevance (not just binary) | Requires a relevance score per item, not just relevant/not relevant |
| Coverage | What percentage of your catalog gets recommended to at least one user? | A bookstore where only 3 shelves out of 50 get any foot traffic. | When you suspect popularity bias. Low coverage means your model ignores the long tail. | 100% coverage is not the goal. Recommending irrelevant items to boost coverage is worse. |
| Diversity | How different are the items within a single user's recommendation list? | A playlist that is 10 different songs versus 10 remixes of the same song. | When filter bubbles are a concern. Especially media, content, and e-commerce. | Diversity and relevance trade off. Maximizing diversity gives you random recommendations. |
MAP@K is the standard comparison metric. NDCG when order matters. Coverage and diversity to catch pathological systems that score well on accuracy but recommend the same 100 items to everyone.
Why offline metrics lie (and what to do about it)
Here is the uncomfortable truth: a model can achieve the best MAP@K in your evaluation and still be useless in production. Offline evaluation only measures whether the model can predict what users did historically. It cannot measure whether users would have found those items anyway, whether the recommendations led to higher satisfaction, or whether the model is creating filter bubbles that hurt long-term retention.
The fix is A/B testing. Run your candidate model against your current system on live traffic and measure what matters: incremental revenue per user, long-term engagement, and catalog exploration breadth. If your new model has 20% better NDCG but zero incremental revenue lift, it is recommending items people would have bought regardless. That is not a recommendation system. That is a confirmation system.
8 proven methods to improve recommendation quality
These are ordered from tactical fixes to structural changes. Methods 1-7 optimize how your model handles user-item pairs. Method 8 changes what your model sees entirely.
1. Go hybrid (collaborative + content-based)
Pure collaborative filtering breaks on cold-start. Pure content-based filtering never surprises anyone because it only recommends similar items. Combine them. Use collaborative signals for users with rich interaction history and fall back to content-based for new users and new items.
The simplest hybrid: train a collaborative model and a content model independently, then blend their scores with a learned weight. The weight should vary by user: heavy on content for new users (sparse history), heavy on collaborative for established users (rich history).
Typical improvement: 10-15% lift in Recall@K over either approach alone. The biggest gains come from cold-start users who get real recommendations instead of bestseller defaults.
2. Add implicit feedback (not just purchases)
Most recommendation systems are trained on purchases or ratings. This ignores 95% of user behavior. Clicks, time spent viewing, scroll depth, add-to-cart, save-for-later, repeat visits, and even search queries are all signals of interest. A user who spent 4 minutes reading a product description and then left is telling you something different from a user who bounced in 2 seconds.
Weight implicit signals by strength: purchase (1.0) > add-to-cart (0.7) > extended view (0.4) > click (0.2) > impression (0.05). The exact weights matter less than the hierarchy. Treating all interactions equally is the most common mistake.
Typical improvement: 15-25% lift in MAP@K over purchase-only training data. You already have this data. You are just not using it.
3. Handle cold-start with content features
When a new product launches, it has zero interactions but it has attributes: category, brand, price range, description text, images. Feed these into your model as side features. The model learns that users who bought Nike running shoes in the $120-150 range are likely interested in a new Nike running shoe at $135, even if no one has bought the new shoe yet.
For new users, use onboarding signals: what categories they browsed in their first session, what they searched for, what demographic bucket they fall into. Even 5 minutes of browsing behavior is enough to beat bestseller defaults.
Typical improvement: 30-50% lift in first-session recommendations vs. popularity-based defaults. The value compounds: better first-session recommendations lead to higher engagement, which generates more data, which improves future recommendations.
4. Add contextual signals
The same user wants different things at different times. A grocery shopper at 7 AM on Monday is buying coffee and breakfast items, not planning a dinner party. A music listener at the gym wants high-energy tracks, not ambient piano. Time of day, day of week, device type, location, and season all carry signal.
The implementation is straightforward: add context features to your ranking model. hour_of_day, day_of_week, device_type, days_since_last_visit. Let the model learn that users on mobile at lunch browse differently than users on desktop at night.
Typical improvement: 5-10% lift in CTR. The gains are most visible in domains with strong temporal patterns: food, media, fashion (seasonal), and event-driven categories.
5. Re-rank for diversity
Your model's top 10 recommendations are 10 black dresses. The user was browsing dresses, so the model is technically correct. But showing 10 near-identical items is a wasted opportunity. Swap positions 3, 5, 7, and 9 with the highest-scoring items from different categories. You sacrifice a fraction of relevance and gain significantly in user experience.
The formal version is Maximal Marginal Relevance (MMR): at each position, pick the item that maximizes a blend of relevance score and dissimilarity from items already in the list. The lambda parameter controls the tradeoff. Start at 0.7 (70% relevance, 30% diversity) and tune from there.
Typical improvement: 0-5% change in CTR (can go either direction), but 15-30% improvement in catalog coverage and long-term user retention. Diversity pays off over weeks and months, not in single-session metrics.
6. Use negative signals
Your model learns from what users clicked. It should also learn from what users rejected. A product that was shown, clicked, and immediately bounced is a negative signal. A product that was purchased and returned is a strong negative signal. A product that was shown repeatedly and never clicked is a weak negative signal.
Most systems ignore returns entirely. A returned item stays in the training data as a "purchase," teaching the model that this user liked the item. They didn't. They actively disliked it enough to go through the hassle of returning it. Flip that signal.
Typical improvement: 5-10% reduction in return rate for recommended items. The ROI is directly measurable in reduced reverse logistics costs.
7. A/B test everything (offline metrics lie)
You improved NDCG by 12% offline. Congratulations. Deploy it to 5% of traffic and measure incremental revenue. If the lift is zero, your model is better at predicting what users already intended to buy, not at influencing what they actually buy. Those are different skills.
The metric hierarchy for A/B tests: (1) incremental revenue per user, (2) conversion rate on recommendations, (3) catalog exploration breadth, (4) 30-day retention delta. CTR on recommendations is a vanity metric. A clickbait recommendation has high CTR and zero conversion. Optimize for value, not attention.
Typical improvement: Varies, but the insight is the improvement. Teams that A/B test rigorously ship fewer models but better ones. The average offline-to-online correlation in recommendation systems is shockingly low. Do not trust your notebook.
8. Connect your data graph
Methods 1-7 make your model smarter about user-item pairs. Method 8 makes your model smarter about the WORLD those pairs live in.
Collaborative filtering is like asking "what did people who look like you buy?" Graph-based recommendations ask "what did people who THINK like you buy?" The difference is that looking alike is surface-level. Thinking alike is structural. Two users might have completely different demographics and purchase histories, but if they navigate your catalog in similar patterns, engage with similar review content, and connect to similar product clusters, they think alike. You can only see that in the graph.
The graph connects everything: users to items through purchases, items to categories through taxonomy, categories to other items through shared attributes, items to reviews through sentiment, and reviews to other users through authorship. Every edge is a signal. Every path is a potential recommendation reason.
Typical improvement: 2-4x lift in MAP@K over flat-table approaches on the same data. This is not incremental. This is a category change.
The graph advantage: why connected data changes everything
Traditional recommendation models are like reading a book by looking at individual sentences in isolation. Graph-based models read the whole chapter. The meaning - the connections between characters, the narrative arc, the subtext - only emerges when you see how everything relates.
Multi-hop discovery
Here is a recommendation that collaborative filtering cannot make but a graph model can: User A purchased Product X. Product X was reviewed positively by User B. User B also purchased Product Y. Product Y is in the same category as Product Z, which has high ratings from users with similar browsing patterns to User A. Recommend Product Z to User A.
That is four hops through the graph. Each hop adds context. Each hop narrows the recommendation from "random item" to "item connected to this user through multiple independent paths of evidence." Collaborative filtering sees one hop: users who bought X also bought Y. The graph sees the full reasoning chain.
Cold-start, solved structurally
A new product has zero purchases. In collaborative filtering, it is invisible. In the graph, it connects to existing products through shared category, brand, price range, supplier, and attribute edges. A new Nike running shoe at $135 is immediately connected to every other Nike shoe, every other running shoe in that price range, and every user who has engaged with those connected items. The graph gives the new product a recommendation surface from day one, without waiting for interaction data to accumulate.
The benchmark: traditional methods vs. graph approach
The RelBench benchmark provides standardized comparisons on real-world relational datasets. On the recommendation task:
relbench_recommendation_benchmark
| approach | MAP@K | feature_engineering_required | what_it_captures |
|---|---|---|---|
| LightGBM + manual features | 1.79 | Yes (extensive joins, aggregations, time windows) | Static aggregates from flattened tables |
| GraphSAGE | 1.85 | No (reads graph structure) | Local graph neighborhood patterns |
| KumoRFM | 7.29 | No (reads raw relational tables directly) | Multi-hop patterns, temporal dynamics, full entity relationships |
KumoRFM achieves 7.29 MAP@K vs. GraphSAGE at 1.85 and LightGBM at 1.79. A 4x improvement from seeing the full relational structure.
That 4x gap deserves attention. It is not a marginal improvement from a better loss function or a clever training trick. It is the difference between a model that sees user-item pairs and a model that sees the entire relational structure those pairs exist within. GraphSAGE reads graph structure but uses a relatively simple message passing scheme. KumoRFM reads the full relational data with temporal awareness and multi-hop reasoning. The data is the same. The depth of understanding is not.
PQL Query
PREDICT TOP 5 product_id FOR EACH users.user_id RANK BY purchase_probability
This query generates the top 5 product recommendations for each user, ranked by predicted purchase probability. The model reads the full relational graph - users, products, categories, reviews, transactions - and reasons through multi-hop paths to surface items the user is most likely to purchase.
Output
| user_id | rank | product_id | purchase_probability | recommendation_reason |
|---|---|---|---|---|
| U-2201 | 1 | P-8834 | 0.73 | Users with similar purchase graphs bought this; matches category affinity |
| U-2201 | 2 | P-5512 | 0.68 | Same brand cluster as recent purchases; high review similarity |
| U-2201 | 3 | P-9901 | 0.61 | Trending in connected user segment; seasonal relevance |
| U-3307 | 1 | P-1120 | 0.79 | Strong multi-hop path through shared category and reviewer network |
| U-3307 | 2 | P-4455 | 0.65 | Cold-start item connected via brand, price range, and supplier edges |
Traditional recommendation model
- Treats each user-item pair independently
- Requires manual feature engineering from multiple tables
- Cannot recommend items with zero interaction history
- Cannot capture multi-hop relationships (user > product > reviewer > product)
- Typical MAP@K: 1-2 on relational benchmarks
Graph-based recommendation model
- Reasons over the full relationship structure
- No manual feature engineering required
- Recommends new items from day one via attribute connections
- Discovers multi-hop paths through message passing
- Typical MAP@K: 5-7+ on relational benchmarks
Recommendation tools: an honest comparison
The right tool depends on your catalog size, your engineering team, and whether recommendations are your core product or a feature of your product. Some of these are libraries you integrate. Some are platforms you buy. Here's the honest breakdown.
recommendation_tools_compared
| tool | type | price | best_for | honest_limitation |
|---|---|---|---|---|
| Collaborative Filtering (Surprise, Implicit) | Open-source library | Free | Prototyping and baselines. Start here to establish a floor to beat. | Plateaus fast. No cold-start handling. You build and maintain everything. |
| Amazon Personalize | Managed AWS service | $0.05/GB + inference | Teams already on AWS who want recommendations without building infrastructure. | Black box. Limited customization. Costs scale unpredictably with traffic. |
| Dynamic Yield (Mastercard) | Personalization platform | Enterprise pricing | E-commerce personalization across web, email, and ads. Strong A/B testing. | Primarily rule-based with ML overlay. Not a deep learning recommendation engine. |
| Bloomreach | Commerce experience platform | Enterprise pricing | Product discovery and search-driven recommendations for retail. | Tightly coupled to their search product. Less flexible for non-retail use cases. |
| Algolia Recommend | API-based recommendations | Usage-based pricing | Fast integration for teams that already use Algolia Search. Simple API. | Limited to item-to-item and trending. No deep personalization or graph reasoning. |
| Google Recommendations AI | Managed GCP service | Pay-per-prediction | Large-scale retail with Google Cloud infrastructure. Strong integration with BigQuery. | Requires significant GCP commitment. Pricing opaque at scale. |
| Kumo.ai | Relational foundation model | Free tier / Enterprise | Multi-table predictions without feature engineering. Graph-native recommendations. | Requires relational data. If your data is a single interaction log, simpler tools suffice. |
Highlighted: Kumo.ai is the only tool that reads multi-table relational data natively for recommendations. But if you need a quick integration with an existing search product, Algolia Recommend gets you live in days.
Picking the right tool for your situation
- Early-stage, proving the value of recs: Open-source collaborative filtering (Implicit library) to build a baseline in a week. Measure whether recommendations drive incremental revenue. If yes, invest in something better.
- Mid-stage, need production recs without a large ML team: Amazon Personalize or Google Recommendations AI. Managed infrastructure, reasonable accuracy, minimal ML expertise required.
- E-commerce with rich catalog and user data across multiple tables: Kumo.ai reads your relational database directly and captures cross-table signals (user-product-category-review-brand) that pairwise models miss.
- Already using Algolia for search: Algolia Recommend. The integration is trivial and you get item-to-item recommendations out of the box. Upgrade later if you need deeper personalization.
6 common recommendation mistakes (and what to do instead)
These mistakes are not theoretical. They are running in production at companies right now, quietly degrading user experience and leaving revenue on the table.
1. Optimizing for clicks, not value
Your model learns to maximize click-through rate. It discovers that sensational titles, deep discounts, and clickbait thumbnails get clicks. The recommendations look engaging. The conversion rate is terrible. Users click, browse for 3 seconds, and bounce. You optimized for attention, not intent.
Fix: optimize for downstream value. Train on purchases, not clicks. Or better: train on a weighted combination where purchase is 1.0, add-to-cart is 0.5, extended view is 0.2, and click is 0.05. Let the model learn that a click without follow-through is a weak signal.
2. The echo chamber
Your model recommends thrillers to thriller readers. They engage with thrillers. The model recommends more thrillers. The user's profile calcifies. Six months later they leave because the platform feels repetitive. You optimized for session engagement and destroyed long-term retention.
Fix: measure 90-day retention by recommendation diversity quartile. If users who received diverse recommendations retain better than users who received narrow recommendations, your model is creating echo chambers. Add diversity re-ranking or epsilon-greedy exploration.
3. Ignoring cold-start until launch day
The team builds a beautiful collaborative filtering model on historical data. It works great in the notebook. Launch day: 40% of the traffic is new users with zero history. The model returns bestsellers for all of them. The personalization project that took 6 months delivers zero personalization for nearly half your users.
Fix: design for cold-start from day one. Content-based fallbacks, onboarding flows that capture preferences, and graph-based models that connect new users through contextual signals. Cold-start is not an edge case. It is your largest user segment.
4. Training on implicit feedback without cleaning it
A user clicked on a product because it appeared in position 1 of search results, not because they were interested. Another user viewed a product page for 45 seconds because they were comparison shopping and decided against it. A third user "purchased" a product that they returned 3 days later. All three show up as positive signals in your training data. They are not.
Fix: weight interactions by signal strength and correct for position bias. Discount clicks in position 1 (they get clicked regardless of relevance). Treat returns as negative signals. Require minimum dwell time for views to count as positive.
5. Never measuring beyond accuracy
Your MAP@K is excellent. Your model recommends the same 200 items (out of 50,000) to 90% of users. Coverage: 0.4%. Diversity: near zero. Serendipity: zero. The model found a local optimum where popular items are always "correct" and the long tail is invisible. Your catalog is rotting.
Fix: add coverage, diversity, and novelty to your evaluation dashboard alongside accuracy metrics. Set minimum thresholds: "at least 15% of catalog recommended per week" or "no more than 30% of any user's recommendations from the same category." Treat these as hard constraints, not nice-to-haves.
6. Rebuilding the model from scratch for every use case
The product team wants homepage recommendations. Build a model. Now they want "you might also like" on product pages. Build another model. Now email recommendations. A third model. Cart recommendations. A fourth. Each model has its own feature pipeline, training job, serving infrastructure, and maintenance burden. Your ML team is drowning in recommendation variants.
Fix: invest in a unified representation layer. Learn user and item embeddings once, from the full interaction graph. Then use those embeddings as input to lightweight task-specific heads for each use case. One training pipeline, one embedding store, multiple recommendation surfaces. Graph-based approaches excel here because the learned embeddings capture general relational structure that transfers across recommendation tasks.