Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Use Case11 min read

Content Recommendations: GNN on User-Content Interaction Graphs

Streaming platforms serve billions of recommendations daily. Content-based filtering traps users in filter bubbles. Here is how to build a GNN that combines content features with collaborative signals for cross-genre discovery.

PyTorch Geometric

TL;DR

  • 1Content recommendation is a link prediction problem on a user-content-interaction graph. GNNs combine content features with collaborative signals for recommendations that transcend genre boundaries.
  • 2Heterogeneous SAGEConv models multiple interaction types (views, completions, likes, shares) with different weights, capturing engagement depth beyond simple clicks.
  • 3On RelBench benchmarks, GNNs achieve 75.83 AUROC vs 62.44 for flat-table LightGBM. Cross-user collaborative signals drive the improvement.
  • 4Cold-start content gets recommendations via creator and genre connections. GraphSAGE is inductive: new content nodes get embeddings from features and graph context immediately.
  • 5KumoRFM generates content recommendations with one PQL query (76.71 AUROC zero-shot), handling cold-start, engagement depth, and diversity automatically.

The business problem

Netflix estimates that its recommendation system is worth $1 billion per year in retained subscribers. YouTube serves over 700 million recommendations per day. For streaming and media platforms, recommendation quality directly determines engagement, retention, and revenue. A 0.1% improvement in click-through rate at YouTube's scale translates to millions in additional ad revenue.

Content-based filtering recommends similar content: if you watched a sci-fi movie, here are more sci-fi movies. This creates filter bubbles and misses serendipitous recommendations. Collaborative filtering adds cross-user signal but struggles with cold-start content and cannot leverage rich content metadata.

Why flat ML fails

  • Filter bubbles: Content features (genre, tags) create narrow recommendation corridors. GNNs discover that documentary fans also enjoy historical dramas through cross-user paths.
  • Shallow engagement signal: Click-through rate is a noisy signal. The graph encodes richer engagement: completion rate, re-watches, shares, and saves carry different weight.
  • Content cold-start: New content has no interaction data. The graph connects new content to its creator, genre, and similar existing content, providing immediate relevance signals.
  • Temporal dynamics: Content relevance decays and trends emerge rapidly. Graph-based models capture trending patterns through rapid edge accumulation on new content.

The relational schema

schema.txt
Node types:
  User     (id, age_bucket, geo, platform)
  Content  (id, type, genre, duration, title_emb)
  Creator  (id, follower_count, genre_focus)
  Tag      (id, name, category)

Edge types:
  User    --[watched]-->     Content  (completion_pct, timestamp)
  User    --[liked]-->       Content  (timestamp)
  User    --[shared]-->      Content  (timestamp)
  Content --[created_by]-->  Creator
  Content --[has_tag]-->     Tag
  Content --[similar_to]-->  Content  (cosine_sim)

Multiple interaction types capture engagement depth. Creator and tag nodes enable cold-start recommendations for new content.

PyG architecture: heterogeneous SAGEConv

content_rec_model.py
import torch
import torch.nn.functional as F
from torch_geometric.nn import SAGEConv, HeteroConv, Linear

class ContentRecGNN(torch.nn.Module):
    def __init__(self, hidden_dim=128):
        super().__init__()
        self.user_lin = Linear(-1, hidden_dim)
        self.content_lin = Linear(-1, hidden_dim)
        self.creator_lin = Linear(-1, hidden_dim)
        self.tag_lin = Linear(-1, hidden_dim)

        self.conv1 = HeteroConv({
            ('user', 'watched', 'content'): SAGEConv(
                hidden_dim, hidden_dim),
            ('user', 'liked', 'content'): SAGEConv(
                hidden_dim, hidden_dim),
            ('user', 'shared', 'content'): SAGEConv(
                hidden_dim, hidden_dim),
            ('content', 'created_by', 'creator'): SAGEConv(
                hidden_dim, hidden_dim),
            ('content', 'has_tag', 'tag'): SAGEConv(
                hidden_dim, hidden_dim),
        }, aggr='sum')

        self.conv2 = HeteroConv({
            ('user', 'watched', 'content'): SAGEConv(
                hidden_dim, hidden_dim),
            ('user', 'liked', 'content'): SAGEConv(
                hidden_dim, hidden_dim),
            ('content', 'created_by', 'creator'): SAGEConv(
                hidden_dim, hidden_dim),
            ('content', 'has_tag', 'tag'): SAGEConv(
                hidden_dim, hidden_dim),
        }, aggr='sum')

    def encode(self, x_dict, edge_index_dict):
        x_dict['user'] = self.user_lin(x_dict['user'])
        x_dict['content'] = self.content_lin(x_dict['content'])
        x_dict['creator'] = self.creator_lin(x_dict['creator'])
        x_dict['tag'] = self.tag_lin(x_dict['tag'])

        x_dict = {k: F.relu(v) for k, v in
                  self.conv1(x_dict, edge_index_dict).items()}
        x_dict = self.conv2(x_dict, edge_index_dict)
        return x_dict

    def predict(self, user_emb, content_emb):
        return (user_emb * content_emb).sum(dim=-1)

Separate edge types for watched/liked/shared let the model weight engagement depth differently. Creator and tag connections enable cold-start recommendations.

Expected performance

  • Content-based filtering: ~50 AUROC
  • LightGBM (flat-table): 62.44 AUROC
  • GNN (heterogeneous SAGEConv): 75.83 AUROC
  • KumoRFM (zero-shot): 76.71 AUROC

Or use KumoRFM in one line

KumoRFM PQL
PREDICT content_id FOR user
USING user, content, creator, interaction

One PQL query. KumoRFM handles content cold-start, engagement depth, and cross-genre discovery automatically.

Frequently asked questions

How do GNN recommendations differ from content-based filtering?

Content-based filtering recommends items similar to what a user has consumed, based on content features (genre, tags). GNNs combine content similarity with collaborative signal: they see that users who watched documentary A also watched drama B, even if A and B are dissimilar in content features. This enables cross-genre discovery that content-based filtering cannot achieve.

What interactions should be included in the user-content graph?

Views, completions, likes, shares, saves, and time-spent. Each interaction type becomes a separate edge type with different weights. Completion is stronger signal than views, shares are stronger than likes. The heterogeneous graph captures engagement depth, not just clicks.

How do you handle the cold-start problem for new content?

New content nodes have content features (title embedding, genre, creator) but no interaction edges. GraphSAGE encodes these features through the creator and genre nodes, which are well-connected. A new video from a popular creator inherits relevance signal from the creator's existing audience graph.

How do you balance exploration vs exploitation in GNN recommendations?

Add diversity by sampling from the full embedding space rather than always returning the top-K nearest neighbors. Use epsilon-greedy: with probability epsilon, sample from a broader set of candidates based on graph distance. The GNN provides the exploitation signal; the sampling strategy provides exploration.

Can KumoRFM generate content recommendations?

Yes. KumoRFM takes your content database (users, content, interactions, creators) and generates personalized recommendations with one PQL query. It handles content cold-start, engagement depth, and cross-genre discovery automatically.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.