The business problem
Netflix estimates that its recommendation system is worth $1 billion per year in retained subscribers. YouTube serves over 700 million recommendations per day. For streaming and media platforms, recommendation quality directly determines engagement, retention, and revenue. A 0.1% improvement in click-through rate at YouTube's scale translates to millions in additional ad revenue.
Content-based filtering recommends similar content: if you watched a sci-fi movie, here are more sci-fi movies. This creates filter bubbles and misses serendipitous recommendations. Collaborative filtering adds cross-user signal but struggles with cold-start content and cannot leverage rich content metadata.
Why flat ML fails
- Filter bubbles: Content features (genre, tags) create narrow recommendation corridors. GNNs discover that documentary fans also enjoy historical dramas through cross-user paths.
- Shallow engagement signal: Click-through rate is a noisy signal. The graph encodes richer engagement: completion rate, re-watches, shares, and saves carry different weight.
- Content cold-start: New content has no interaction data. The graph connects new content to its creator, genre, and similar existing content, providing immediate relevance signals.
- Temporal dynamics: Content relevance decays and trends emerge rapidly. Graph-based models capture trending patterns through rapid edge accumulation on new content.
The relational schema
Node types:
User (id, age_bucket, geo, platform)
Content (id, type, genre, duration, title_emb)
Creator (id, follower_count, genre_focus)
Tag (id, name, category)
Edge types:
User --[watched]--> Content (completion_pct, timestamp)
User --[liked]--> Content (timestamp)
User --[shared]--> Content (timestamp)
Content --[created_by]--> Creator
Content --[has_tag]--> Tag
Content --[similar_to]--> Content (cosine_sim)Multiple interaction types capture engagement depth. Creator and tag nodes enable cold-start recommendations for new content.
PyG architecture: heterogeneous SAGEConv
import torch
import torch.nn.functional as F
from torch_geometric.nn import SAGEConv, HeteroConv, Linear
class ContentRecGNN(torch.nn.Module):
def __init__(self, hidden_dim=128):
super().__init__()
self.user_lin = Linear(-1, hidden_dim)
self.content_lin = Linear(-1, hidden_dim)
self.creator_lin = Linear(-1, hidden_dim)
self.tag_lin = Linear(-1, hidden_dim)
self.conv1 = HeteroConv({
('user', 'watched', 'content'): SAGEConv(
hidden_dim, hidden_dim),
('user', 'liked', 'content'): SAGEConv(
hidden_dim, hidden_dim),
('user', 'shared', 'content'): SAGEConv(
hidden_dim, hidden_dim),
('content', 'created_by', 'creator'): SAGEConv(
hidden_dim, hidden_dim),
('content', 'has_tag', 'tag'): SAGEConv(
hidden_dim, hidden_dim),
}, aggr='sum')
self.conv2 = HeteroConv({
('user', 'watched', 'content'): SAGEConv(
hidden_dim, hidden_dim),
('user', 'liked', 'content'): SAGEConv(
hidden_dim, hidden_dim),
('content', 'created_by', 'creator'): SAGEConv(
hidden_dim, hidden_dim),
('content', 'has_tag', 'tag'): SAGEConv(
hidden_dim, hidden_dim),
}, aggr='sum')
def encode(self, x_dict, edge_index_dict):
x_dict['user'] = self.user_lin(x_dict['user'])
x_dict['content'] = self.content_lin(x_dict['content'])
x_dict['creator'] = self.creator_lin(x_dict['creator'])
x_dict['tag'] = self.tag_lin(x_dict['tag'])
x_dict = {k: F.relu(v) for k, v in
self.conv1(x_dict, edge_index_dict).items()}
x_dict = self.conv2(x_dict, edge_index_dict)
return x_dict
def predict(self, user_emb, content_emb):
return (user_emb * content_emb).sum(dim=-1)Separate edge types for watched/liked/shared let the model weight engagement depth differently. Creator and tag connections enable cold-start recommendations.
Expected performance
- Content-based filtering: ~50 AUROC
- LightGBM (flat-table): 62.44 AUROC
- GNN (heterogeneous SAGEConv): 75.83 AUROC
- KumoRFM (zero-shot): 76.71 AUROC
Or use KumoRFM in one line
PREDICT content_id FOR user
USING user, content, creator, interactionOne PQL query. KumoRFM handles content cold-start, engagement depth, and cross-genre discovery automatically.