How do GNN recommendations differ from content-based filtering?

Content-based filtering recommends items similar to what a user has consumed, based on content features (genre, tags). GNNs combine content similarity with collaborative signal: they see that users who watched documentary A also watched drama B, even if A and B are dissimilar in content features. This enables cross-genre discovery that content-based filtering cannot achieve.

What interactions should be included in the user-content graph?

Views, completions, likes, shares, saves, and time-spent. Each interaction type becomes a separate edge type with different weights. Completion is stronger signal than views, shares are stronger than likes. The heterogeneous graph captures engagement depth, not just clicks.

How do you handle the cold-start problem for new content?

New content nodes have content features (title embedding, genre, creator) but no interaction edges. GraphSAGE encodes these features through the creator and genre nodes, which are well-connected. A new video from a popular creator inherits relevance signal from the creator's existing audience graph.

How do you balance exploration vs exploitation in GNN recommendations?

Add diversity by sampling from the full embedding space rather than always returning the top-K nearest neighbors. Use epsilon-greedy: with probability epsilon, sample from a broader set of candidates based on graph distance. The GNN provides the exploitation signal; the sampling strategy provides exploration.

Can KumoRFM generate content recommendations?

Yes. KumoRFM takes your content database (users, content, interactions, creators) and generates personalized recommendations with one PQL query. It handles content cold-start, engagement depth, and cross-genre discovery automatically.

Content Recommendations with PyG: GNN on User-Content Interaction Graphs | PyG Guide

The business problem

Netflix estimates that its recommendation system is worth $1 billion per year in retained subscribers. YouTube serves over 700 million recommendations per day. For streaming and media platforms, recommendation quality directly determines engagement, retention, and revenue. A 0.1% improvement in click-through rate at YouTube's scale translates to millions in additional ad revenue.

Content-based filtering recommends similar content: if you watched a sci-fi movie, here are more sci-fi movies. This creates filter bubbles and misses serendipitous recommendations. Collaborative filtering adds cross-user signal but struggles with cold-start content and cannot leverage rich content metadata.

Why flat ML fails

Filter bubbles: Content features (genre, tags) create narrow recommendation corridors. GNNs discover that documentary fans also enjoy historical dramas through cross-user paths.
Shallow engagement signal: Click-through rate is a noisy signal. The graph encodes richer engagement: completion rate, re-watches, shares, and saves carry different weight.
Content cold-start: New content has no interaction data. The graph connects new content to its creator, genre, and similar existing content, providing immediate relevance signals.
Temporal dynamics: Content relevance decays and trends emerge rapidly. Graph-based models capture trending patterns through rapid edge accumulation on new content.

The relational schema

schema.txt

Node types:
  User     (id, age_bucket, geo, platform)
  Content  (id, type, genre, duration, title_emb)
  Creator  (id, follower_count, genre_focus)
  Tag      (id, name, category)

Edge types:
  User    --[watched]-->     Content  (completion_pct, timestamp)
  User    --[liked]-->       Content  (timestamp)
  User    --[shared]-->      Content  (timestamp)
  Content --[created_by]-->  Creator
  Content --[has_tag]-->     Tag
  Content --[similar_to]-->  Content  (cosine_sim)

Multiple interaction types capture engagement depth. Creator and tag nodes enable cold-start recommendations for new content.

PyG architecture: heterogeneous SAGEConv

content_rec_model.py

import torch
import torch.nn.functional as F
from torch_geometric.nn import SAGEConv, HeteroConv, Linear

class ContentRecGNN(torch.nn.Module):
    def __init__(self, hidden_dim=128):
        super().__init__()
        self.user_lin = Linear(-1, hidden_dim)
        self.content_lin = Linear(-1, hidden_dim)
        self.creator_lin = Linear(-1, hidden_dim)
        self.tag_lin = Linear(-1, hidden_dim)

        self.conv1 = HeteroConv({
            ('user', 'watched', 'content'): SAGEConv(
                hidden_dim, hidden_dim),
            ('user', 'liked', 'content'): SAGEConv(
                hidden_dim, hidden_dim),
            ('user', 'shared', 'content'): SAGEConv(
                hidden_dim, hidden_dim),
            ('content', 'created_by', 'creator'): SAGEConv(
                hidden_dim, hidden_dim),
            ('content', 'has_tag', 'tag'): SAGEConv(
                hidden_dim, hidden_dim),
        }, aggr='sum')

        self.conv2 = HeteroConv({
            ('user', 'watched', 'content'): SAGEConv(
                hidden_dim, hidden_dim),
            ('user', 'liked', 'content'): SAGEConv(
                hidden_dim, hidden_dim),
            ('content', 'created_by', 'creator'): SAGEConv(
                hidden_dim, hidden_dim),
            ('content', 'has_tag', 'tag'): SAGEConv(
                hidden_dim, hidden_dim),
        }, aggr='sum')

    def encode(self, x_dict, edge_index_dict):
        x_dict['user'] = self.user_lin(x_dict['user'])
        x_dict['content'] = self.content_lin(x_dict['content'])
        x_dict['creator'] = self.creator_lin(x_dict['creator'])
        x_dict['tag'] = self.tag_lin(x_dict['tag'])

        x_dict = {k: F.relu(v) for k, v in
                  self.conv1(x_dict, edge_index_dict).items()}
        x_dict = self.conv2(x_dict, edge_index_dict)
        return x_dict

    def predict(self, user_emb, content_emb):
        return (user_emb * content_emb).sum(dim=-1)

Separate edge types for watched/liked/shared let the model weight engagement depth differently. Creator and tag connections enable cold-start recommendations.

Expected performance

Content-based filtering: ~50 AUROC
LightGBM (flat-table): 62.44 AUROC
GNN (heterogeneous SAGEConv): 75.83 AUROC
KumoRFM (zero-shot): 76.71 AUROC

Or use KumoRFM in one line

KumoRFM PQL

PREDICT content_id FOR user
USING user, content, creator, interaction

One PQL query. KumoRFM handles content cold-start, engagement depth, and cross-genre discovery automatically.

Content Recommendations: GNN on User-Content Interaction Graphs