Understanding Hybrid Graph Neural Networks for Recommendations | Kumo.ai

01

The Recommendation Problem

Recommendation systems power everything from e-commerce to food delivery to streaming platforms. The challenge: users behave in fundamentally different ways. Some are repeaters who buy the same items over and over (the same coffee beans every two weeks, the same restaurant every Friday). Others are explorers who constantly seek novelty (new cuisines, new brands, new categories).

Most users exhibit both behaviors depending on context. A customer might repeatedly order from the same grocery store while exploring new restaurants. Capturing both patterns within a single model is the core technical challenge.

Why traditional approaches struggle

Production recommendation systems at large companies typically involve complex multi-stage pipelines with significant engineering overhead. A typical setup includes multiple candidate generation steps (collaborative filtering, content-based retrieval, popularity-based fallbacks) followed by ranking model ensembles that blend the candidates.

These pipelines also face persistent problems: cold-start for new users with sparse history, limited data diversity, and the sheer complexity of maintaining dozens of models that each capture a narrow slice of user behavior.

02

Why Graph Neural Networks?

Recommendation data naturally forms a bipartite graph. Users and items are nodes. Interactions (purchases, clicks, orders) are edges connecting them. These edges often carry timestamps, prices, ratings, and other features. Multiple edges can exist between the same user-item pair, reflecting repeat purchases over time.

This makes recommendation a link prediction task: given the graph of past interactions, predict which new edges (future purchases) will form.

What GNNs capture that traditional models miss

GNNs operate directly on graph structure, learning from connectivity patterns rather than hand-engineered features. A GNN processing a user node aggregates information from neighboring item nodes (past purchases), which themselves aggregate information from other users who bought them. This multi-hop message passing captures collaborative signals automatically.

Traditional models require a feature engineer to manually encode these patterns: “number of purchases in category X,” “average time between orders,” “co-purchase frequency with item Y.” A GNN discovers these relationships from the raw graph without explicit feature engineering.

1

Bipartite Graph

Users and items as nodes, interactions as timestamped edges

→

2

Neighbor Sampling

1-hop subgraph centered on each user captures recent interactions

→

3

Message Passing

GNN aggregates features from neighboring nodes across hops

→

4

Link Prediction

Predict which user-item edges will form in the future

03

The Hybrid GNN Architecture

The key innovation is a single GNN backbone with two differential scoring mechanisms, one optimized for repeat behavior and one for exploratory behavior, unified by a learned user-specific repetition scalar.

How it works

The model starts by sampling a 1-hop neighborhood around each user node, capturing all previously interacted items along with their features (timestamps, prices, categories). A heterogeneous GNN then computes embeddings for both users and items within this subgraph.

From here, the architecture branches into three parallel components:

1

Approach 1: Repeat Scoring

MLP over GNN embeddings scores items inside the user's subgraph (previously seen items)

→

2

Approach 2: Explore Scoring

Inner product between user embedding and shallow item embeddings scores items outside the subgraph (new discoveries)

→

3

Repetition Scalar

MLP-predicted per-user weight that balances repeat vs. explore scores

Approach 1 (repeat interactions): For items the user has previously interacted with, the model applies a multi-layer perceptron (MLP) over the GNN-computed embeddings. Because these items appear in the sampled subgraph, the GNN has rich contextual information: when the user last purchased the item, how frequently they buy it, and how it relates to their other purchases.

Approach 2 (exploratory interactions): For items outside the user's subgraph (items they have never interacted with), the model uses an inner product between the user's GNN embedding and shallow item embeddings. This is structurally similar to the standard two-tower approach used in production recommender systems, but here it is one component of a larger hybrid model.

Repetition scalar: An MLP predicts a per-user scalar that adjusts the balance between repeat and explore scores. Users who tend to repurchase the same items get a higher repetition weight. Explorers get a lower one. This scalar is not a global hyperparameter; it is learned individually for each user based on their interaction history.

04

Kaggle H&M Benchmark Results

The hybrid GNN was validated on the Kaggle H&M Personalized Fashion Recommendations challenge, one of the largest public recommendation benchmarks. The dataset contains 1.4 million users, 106,000 items, and 31.7 million interactions spanning two years of purchase history.

The task: predict the top 12 items each user will purchase within the next 7 days, evaluated by Mean Average Precision at 12 (MAP@12).

H&M Fashion Recommendation Challenge (MAP@12)
Method	MAP@12	vs. Hybrid GNN
Hybrid GNN	0.031	Baseline
Kaggle Top 10%	0.024	-23%
Kaggle Median	0.021	-32%

The hybrid GNN achieved 47% better performance than the Kaggle median and placed in the top 1% of submissions across more than 3,000 competing teams.

What makes this result remarkable

Top Kaggle competitors typically spend weeks on feature engineering, building elaborate ensembles of dozens of models, and fine-tuning complex multi-stage pipelines. The hybrid GNN required zero feature engineering. Training and inference ran in approximately two hours on a single GPU. The raw interaction graph was the only input.

Ablation study: why both approaches matter

Isolating each scoring approach reveals why the hybrid design is necessary:

Ablation Study: Individual Scoring Approaches (MAP@12)
Scoring Approach	MAP@12	Hybrid GNN Advantage
Approach 1 only (MLP for seen items)	0.023	+35%
Approach 2 only (inner product, two-tower)	0.015	+107%
Full Hybrid GNN	0.031	Baseline

Approach 2 alone (the standard two-tower architecture widely used in production) scores 0.015, meaning the hybrid model delivers 107% better performance than the industry-standard approach. Even Approach 1 alone, which is strong at 0.023, still leaves 35% of the hybrid's performance on the table. Neither component is sufficient alone.

05

Production Results: Food Delivery

Kaggle benchmarks prove methodological validity, but production deployments prove business value. The hybrid GNN was deployed at a major food delivery service for restaurant recommendations, covering over 600,000 restaurant options with a 7-day prediction window.

Food Delivery Production Results (MAP@12)
Method	MAP@12
Hybrid GNN	0.32
Approach 1 only (repeat scoring)	0.31
Approach 2 only (explore scoring)	0.27

Food delivery shows a different pattern than fashion retail. Approach 1 (repeat scoring) performs nearly as well as the full hybrid at 0.31 vs. 0.32. This makes intuitive sense: people reorder from the same restaurants far more frequently than they repurchase the same clothing items. The repeat signal dominates in this domain.

Yet the hybrid still outperforms, because even in a repeat-heavy domain, the exploratory component captures the 10-20% of orders where users try something new. That marginal improvement translated to over $100 million in additional sales for the company.

Traditional Multi-Stage Pipeline

Status quo at most companies

+Well-understood engineering patterns
+Each stage can be independently optimized

−Tens of millions in engineering cost
−No cross-stage signal sharing
−Months of feature engineering per iteration
−Cold-start remains unsolved

Two-Tower (Approach 2 Only)

Industry standard baseline

+Simple and scalable
+Good at discovering new items

−Ignores repeat purchase patterns entirely
−0.015 MAP@12 on H&M (107% worse than hybrid)
−Misses the dominant signal in repeat-heavy domains

Hybrid GNN

Single model, dual behavior

+Top 1% Kaggle with zero feature engineering
+Captures both repeat and explore behavior
+Per-user repetition scalar adapts automatically
+$100M+ incremental revenue in production
+2 hours training on a single GPU

−Requires graph-structured input data
−GNN infrastructure less mature than tabular ML

06

Technical Advantages and Trade-offs

The hybrid GNN collapses what traditionally requires months of engineering into a single trainable architecture. Here is what that means in practice.

Zero feature engineering

The model operates on raw interaction data: user IDs, item IDs, timestamps, and any available edge or node features. There is no feature store, no aggregation logic, no time-window computations. The GNN learns which temporal patterns, co-occurrence signals, and feature combinations matter directly from the graph.

End-to-end optimization

All three components (repeat scoring, explore scoring, repetition scalar) share gradients during training. The repeat scorer learns what “likely to repurchase” looks like while the explore scorer simultaneously learns what “likely to try something new” looks like. The repetition scalar learns each user's balance between these modes. This joint optimization is impossible in a multi-stage pipeline where each model trains independently.

Scalability

The H&M benchmark (1.4M users, 106K items, 31.7M interactions) trained in roughly two hours on one GPU. Neighbor sampling (1-hop) keeps the per-user computation bounded regardless of total graph size. The food delivery deployment scaled to 600,000+ restaurants without architectural changes.

Where it fits and where it does not

The hybrid approach is strongest in domains where users exhibit both repeat and exploratory behavior: e-commerce, food delivery, streaming, grocery. In domains with purely exploratory behavior (one-time purchases like real estate or automobiles), the repeat scoring component adds less value, though the GNN backbone still outperforms flat feature approaches.

07

Implications for Recommendation Teams

The hybrid GNN paper demonstrates a broader shift in how recommendation systems should be built. Three takeaways for teams evaluating this approach:

1. Repeat behavior is a first-class signal, not noise

Most recommendation research optimizes for novelty and serendipity. The ablation results show that repeat behavior is often the dominant signal. In the food delivery deployment, Approach 1 (repeat scoring alone) achieved 0.31 MAP@12 vs. 0.27 for the explore-only approach. Teams that filter out repeat purchases as “already known” are discarding their strongest signal.

2. Graph structure replaces feature engineering

The zero-feature-engineering result on H&M is not a convenience claim. It is a performance claim. The hybrid GNN with no engineered features outperformed Kaggle teams that spent weeks building feature pipelines. The graph structure itself, when processed by a GNN, contains more predictive information than hand-crafted aggregations.

3. Single-model architectures beat ensembles when the model is expressive enough

The hybrid GNN achieves top-1% performance as a single model. This matters for production systems where ensemble complexity creates maintenance burden, latency constraints, and debugging difficulty. A single model that captures both behavioral modes is operationally simpler and performs better than a collection of specialized models.

1

Raw Interaction Data

User-item edges with timestamps and features. No feature engineering required.

→

2

Graph Construction

Build bipartite graph with users, items, and interaction edges.

→

3

Hybrid GNN Training

Single model learns repeat scoring, explore scoring, and per-user blending.

→

4

Production Predictions

Top-K recommendations in one forward pass. Two hours training on one GPU.