
Improving Recommendation Systems with LLMs and Graph Transformers
Why graph transformers outperform LLMs by 15x for personalization, and how combining both achieves the best results.
The Recommendation Problem
Recommendation systems drive revenue across e-commerce, streaming, advertising, and social media. The core task sounds simple: predict which items a user will interact with next. In practice, it requires capturing two distinct types of signal: what the items are (text descriptions, categories, attributes) and how users behave (purchase sequences, browsing patterns, co-purchase relationships).
Large language models excel at understanding item descriptions. They produce rich semantic embeddings that capture meaning, context, and nuance in product text. Graph neural networks excel at modeling behavior. They capture direct interactions (who bought what), indirect connections (users who buy similar products), and temporal patterns (how preferences shift over time).
The question this research addresses: can you combine LLM text understanding with graph-based behavioral modeling to get the best of both? The team tested this on H&M's real fashion recommendation dataset from Kaggle, containing customer profiles, transaction histories, and detailed article descriptions across three interconnected tables.
Why LLMs alone fall short
An LLM can encode product information (name, description, color, material) into a dense vector that captures semantic similarity. Cotton t-shirts cluster near other cotton t-shirts. Evening dresses cluster near formal wear. The embeddings are semantically rich.
But personalization requires more than item similarity. It requires understanding behavioral patterns: which customers buy which products, in what order, at what frequency, and how those patterns relate to other customers' behaviors. LLM embeddings encode what products are. They do not encode what customers do.
Encoder-Based LLMs for Embeddings
Encoder-based LLMs use a transformer architecture to convert text into continuous vector representations (embeddings). Unlike generative models that produce text token by token, encoder models process the full input at once and output a fixed-size vector that encapsulates semantic meaning and context-sensitive information.
Two specific encoder models were tested in this research:
- OpenAI text-embedding-3-large: a commercial embedding model producing high-dimensional vectors optimized for semantic similarity tasks.
- intfloat/e5-base-v2 (HuggingFace): an open-source sentence transformer that generates embeddings competitive with larger commercial models at lower computational cost.
For the product recommendation task, all product information (product name, detailed description, color, material) is concatenated into a single text string and fed through the encoder. The output is a dense vector that positions semantically similar products near each other in embedding space.
The LLM-only baseline approach
In the LLM-only approach, product embeddings are computed directly from item text. Customer embeddings are computed as the average of embeddings from all products that customer previously purchased. Recommendations are generated by finding the products whose embeddings are closest to the customer embedding.
This approach has a fundamental limitation: it reduces each customer to a centroid of their purchase history. A customer who bought both running shoes and a formal suit gets an embedding somewhere between athletic wear and formalwear, which may not meaningfully represent either interest. The averaging destroys temporal ordering, frequency information, and the distinction between different purchasing modes.
The Graph Transformer Approach
Kumo's graph transformer takes a fundamentally different approach. Instead of encoding items and users independently, it models the entire recommendation problem as a heterogeneous temporal graph. Customers, transactions, and products become nodes. Purchases become edges with timestamps. The graph captures not just what was bought, but when, by whom, and in what context.
Build Graph
Connect customers, transactions, and articles into a heterogeneous temporal graph with timestamped edges.
Encode Features
Process node features (text via GloVe or LLM embeddings, categorical attributes, numerical values).
Message Passing
Graph transformer layers propagate information across multi-hop neighborhoods, capturing behavioral patterns.
Predict
Generate top-12 article recommendations per customer for the next 7 days.
Why graphs capture what LLMs miss
The graph structure enables multi-hop reasoning. If Customer A bought products that Customer B also bought, and Customer B recently purchased a new item, the graph can propagate that signal to Customer A as a recommendation. This collaborative filtering happens naturally through message passing across the graph, without any explicit feature engineering.
The temporal dimension adds another layer. The graph transformer understands that a customer who bought winter coats in November and running gear in March has seasonal preferences. It distinguishes between a customer who buys weekly and one who buys quarterly. These temporal patterns are encoded directly in the graph structure, not approximated through manual aggregations.
Kumo's baseline text handling
In the graph-only configuration (without LLM integration), Kumo uses GloVe embeddings for text columns. GloVe produces word-level representations that are averaged across all words in a text field. This captures basic semantic content but lacks the contextual understanding of modern transformer-based models. The phrase “lightweight waterproof running jacket” gets a simple average of its word vectors rather than a context-aware encoding of the full description.
Combining LLMs with Graph Transformers
The core technical contribution of this research is the integration pipeline: using LLM-generated embeddings as input features to Kumo's graph transformer. Rather than treating text understanding and behavioral modeling as separate systems, the combined approach feeds rich semantic representations directly into the graph network.
How the integration works
Kumo supports a Sequence data type that can directly utilize encoder-based LLM text vector representations as feature inputs. The process works as follows:
- Text and categorical columns in the articles table (product name, detailed description, color, material, shape, type) are concatenated into a single string per product.
- The concatenated text is pre-encoded using an LLM encoder (either OpenAI text-embedding-3-large or HuggingFace e5-base-v2), producing a dense vector per product.
- These vectors are ingested by Kumo as Sequence-type features on the article nodes in the graph.
- The graph transformer then uses these semantically rich node features during message passing, combining LLM text understanding with graph-based relational reasoning.
Concatenate Text
Merge product name, description, color, material, shape, and type into one string.
LLM Encode
Pass through OpenAI or HuggingFace encoder to produce dense embedding vectors.
Ingest as Sequence
Load embeddings into Kumo as Sequence-type features on article nodes.
Graph Training
Graph transformer combines LLM features with relational structure during training.
The predictive query
The actual prediction task is expressed as a declarative query: predict the top 12 distinct articles each customer is likely to purchase in the next 7 days. This query drives the entire pipeline, from graph construction to training objective to evaluation. The graph transformer optimizes directly for this ranking task, using the LLM embeddings as enriched node features alongside all other available signals.
H&M Dataset and Experiment Design
All experiments use the H&M Personalized Fashion Recommendations dataset, a public Kaggle competition dataset for product recommendation based on real transaction history. The dataset contains three linked tables:
- Customers: customer profile data including demographic attributes and membership information.
- Transactions: purchase history with timestamps, linking customers to the articles they bought.
- Articles: product information including product name, detailed description, color, material, shape, and type (categorical and text attributes).
Four experimental configurations
LLM-Only
Baseline
- +Rich semantic product embeddings
- +No graph infrastructure needed
- −No behavioral signal
- −Customer = average of purchases
- −2x to 40x worse than graph approaches
Kumo-Only (GloVe)
Strong baseline
- +Full relational structure
- +Temporal behavior patterns
- +Multi-hop reasoning
- −GloVe lacks contextual text understanding
- −Word-level averaging loses phrase meaning
Kumo + HuggingFace
Better
- +Graph structure + transformer text
- +Open-source encoder (e5-base-v2)
- +4% improvement over graph-only
- −Smaller model, less semantic depth
Kumo + OpenAI
Best overall
- +Graph structure + strongest text encoder
- +4% to 11% improvement over graph-only
- +Best scores across all metrics
- −Requires API access for embeddings
The LLM-only baseline uses OpenAI text-embedding-3-large to compute product embeddings, then represents each customer as the average embedding of their previously purchased products. The Kumo-only baseline builds the full heterogeneous temporal graph and uses GloVe for text encoding. The two combined configurations replace GloVe with LLM encoders while keeping all other graph architecture components identical.
Evaluation metrics
All models are evaluated on the same task: recommend the top 12 articles each customer is most likely to purchase in the next 7 days. Four metrics capture different aspects of recommendation quality:
- MAP@12 (Mean Average Precision): measures ranking quality by rewarding correct items placed higher in the list.
- Precision@12: proportion of recommended items that were actually purchased.
- Recall@12: proportion of actual purchases that appeared in the top 12 recommendations.
- F1@12: harmonic mean of precision and recall, balancing both objectives.
Results: 15x Graph Advantage, 11% LLM Uplift
The benchmark results reveal two clear findings. First, graph-based approaches massively outperform LLM-only recommendations. Second, integrating LLM embeddings into the graph provides a consistent, meaningful improvement over graph-only baselines.
| Model | MAP@12 | Precision@12 | Recall@12 | F1@12 |
|---|---|---|---|---|
| LLM-Only (OpenAI) | 0.00190 | 0.00329 | 0.00119 | 0.0071 |
| Kumo-Only (GloVe) | 0.02856 | 0.01023 | 0.05234 | 0.01564 |
| Kumo + HuggingFace (e5-base-v2) | 0.02970 | 0.01099 | 0.05531 | 0.01673 |
| Kumo + OpenAI (text-embedding-3-large) | 0.02976 | 0.01139 | 0.05670 | 0.01730 |
Finding 1: Graphs outperform LLMs by 15x
The LLM-only approach achieves a MAP@12 of 0.00190. The Kumo graph-only approach achieves 0.02856. That is a 15x improvement in ranking quality. Across other metrics, the LLM-only approach performs 2x to 40x worse depending on the metric.
This is not a marginal difference. The LLM-only approach fails at personalization because it reduces each customer to an averaged centroid of their purchase history embeddings. It cannot capture collaborative filtering signals (what similar customers bought), temporal dynamics (seasonal preferences, purchasing frequency), or the complex multi-hop relationships that exist in the transaction graph.
Finding 2: LLM embeddings improve graph models by 4-11%
When LLM embeddings replace GloVe as input features to the graph transformer, performance improves consistently across all four metrics:
- MAP@12: 0.02856 to 0.02976 (Kumo + OpenAI), a 4.2% improvement.
- Precision@12: 0.01023 to 0.01139, an 11.3% improvement.
- Recall@12: 0.05234 to 0.05670, an 8.3% improvement.
- F1@12: 0.01564 to 0.01730, a 10.6% improvement.
The Kumo + OpenAI configuration consistently outperforms Kumo + HuggingFace, suggesting that stronger text encoders produce more useful features for the graph network. The HuggingFace e5-base-v2 model still delivers meaningful improvements (roughly 4-7% across metrics), making open-source encoders a viable option when commercial API access is constrained.
What the LLM embeddings contribute
The improvement from LLM integration is not about replacing the graph's behavioral modeling. It is about giving the graph richer item representations to work with. GloVe embeddings average word vectors without context. The phrase “slim fit stretch denim” is just the average of four word vectors. An LLM encoder produces a single vector that captures the full meaning of the phrase as a unit. When the graph transformer propagates these richer features across customer-product interactions, it can make finer distinctions between products, leading to more precise recommendations.
Practical Implications
This research establishes a clear hierarchy for recommendation system architectures. LLMs alone are insufficient. Graphs alone are strong. Graphs enhanced with LLM features are strongest. The practical takeaways for teams building recommendation systems:
1. Do not use LLMs as standalone recommendation engines
The 15x performance gap between LLM-only and graph-based approaches is not a marginal tuning issue. It reflects a fundamental architectural mismatch. LLM embeddings capture item semantics, not user behavior. Building recommendations on semantic similarity alone misses the collaborative and temporal signals that drive real purchasing decisions.
2. Invest in graph infrastructure first
The jump from LLM-only (MAP@12: 0.00190) to Kumo graph-only (MAP@12: 0.02856) delivers 15x improvement. The jump from graph-only to graph + LLM delivers an additional 4-11%. The overwhelming majority of recommendation quality comes from properly modeling the relational structure. Teams should prioritize building heterogeneous temporal graphs over fine-tuning text encoders.
3. Layer LLM embeddings on top for incremental gains
Once the graph infrastructure is in place, replacing simple text encoders (GloVe) with modern LLM embeddings provides a consistent uplift with minimal architectural changes. Kumo's Sequence data type makes this a configuration change: swap the text encoder, re-train, and measure the improvement. No changes to graph construction, training objectives, or serving infrastructure are needed.
4. Stronger encoders produce better results
OpenAI text-embedding-3-large consistently outperforms HuggingFace e5-base-v2 when integrated with the graph. As text encoders continue to improve, the combined approach will benefit automatically. Teams can upgrade their text encoder independently of their graph architecture, treating it as a pluggable component.
LLM-Only
Not viable for personalization
- +Simple to implement
- +Good item understanding
- −15x worse than graph approaches
- −No behavioral modeling
- −No collaborative filtering
Graph-Only
Strong production baseline
- +15x better than LLM-only
- +Full relational modeling
- +Temporal awareness
- −GloVe text encoding is limited
Graph + LLM
Best available approach
- +4-11% improvement over graph-only
- +Best of both paradigms
- +Pluggable text encoder upgrades
Try KumoRFM on your own data
Zero-shot predictions are free. Fine-tuning is available with a trial.