Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Dataset7 min read

Flickr: Testing GNNs on Noisy Social Graph Structure

Flickr is a social network of 89,250 images connected by shared metadata. Unlike citation or co-purchase graphs, Flickr's edges are noisy -- sharing a location does not mean sharing a category. This tests whether your GNN can extract signal from imperfect structure.

PyTorch Geometric

TL;DR

  • 1Flickr has 89,250 image nodes, 899,756 edges (shared location/gallery/comments), 500 features, and 7 classes. A medium-scale benchmark between Planetoid and Reddit.
  • 2Edges are based on metadata similarity, not semantic similarity. This introduces noise that lowers accuracy across all GNN methods compared to cleaner benchmarks.
  • 3GCN achieves ~53%, GraphSAGE ~50%. Lower numbers reflect graph noise, not poor models. Flickr tests robustness to imperfect graph structure.
  • 4Real-world graphs are noisy. Flickr is one of the few benchmarks that exposes this challenge, making it valuable for evaluating production readiness.

89,250

Nodes

899,756

Edges

500

Features

7

Classes

What Flickr contains

Flickr is a social network built from the Flickr image-sharing platform. Each of the 89,250 nodes represents an image. Edges connect images that share common metadata: the same geographic location, the same gallery, or comments from the same user. Node features are 500-dimensional vectors from image descriptions. The task is to classify images into 7 categories based on their content type.

The key characteristic of Flickr is edge noise. Two images taken at the same tourist landmark may depict completely different subjects (a landscape vs. a portrait). Unlike citations (which reflect topical similarity) or co-purchases (which reflect intent similarity), Flickr's edges carry weaker semantic signal.

Why Flickr matters

Most GNN benchmarks have clean graph structure: citations connect related papers, co-purchases connect related products. Real-world graphs are messier. Customer transaction graphs include routine purchases alongside meaningful ones. Social graphs include bot connections alongside genuine relationships. Flickr is one of the few benchmarks that captures this noise.

The practical lesson: accuracy on Flickr predicts production robustness better than accuracy on Cora. A model that handles Flickr's noisy edges well will likely handle the noise in enterprise graphs. Attention-based methods (GAT, TransformerConv) have an advantage here because they can learn to down-weight noisy edges.

Loading Flickr in PyG

load_flickr.py
from torch_geometric.datasets import Flickr

dataset = Flickr(root='/tmp/Flickr')
data = dataset[0]

print(f"Nodes: {data.num_nodes}")        # 89250
print(f"Edges: {data.num_edges}")        # 899756
print(f"Features: {data.num_features}")  # 500
print(f"Classes: {dataset.num_classes}") # 7

Flickr provides standard train/val/test masks. The dataset downloads are moderate (~200MB).

Original Paper

GraphSAINT: Graph Sampling Based Inductive Learning Method

Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, Viktor Prasanna (2020). ICLR 2020

Read paper →

Benchmark comparison (standard split)

MethodAccuracyYearPaper
GraphSAGE~50.1%2017Hamilton et al.
GCN~53.4%2017Kipf & Welling
GAT~54.2%2018Velickovic et al.
GraphSAINT~51.1%2020Zeng et al.
ClusterGCN~48.1%2019Chiang et al.

Which medium-scale social dataset should I use?

Flickr (89K nodes, 7 classes) has noisy metadata-based edges -- use it to test robustness when graph structure is unreliable. Reddit (232K nodes, 41 classes) has clean co-comment edges and is the standard scalability benchmark. Yelp (716K nodes, 100 multi-labels) tests multi-label classification at scale. If you need to evaluate noise robustness, Flickr is the right choice. If you need clean scalability testing, pick Reddit.

Common tasks and benchmarks

Node classification with the standard split. GraphSAGE achieves ~50.1%, GraphSAINT ~51.1%, GCN ~53.4%, GAT ~54.2%. These numbers are notably lower than Reddit (95%+) despite having fewer classes (7 vs 41), confirming that graph noise is the primary difficulty. Models with attention mechanisms consistently outperform fixed-aggregation models here.

Data source

The Flickr dataset was introduced in the GraphSAINT paper and is available from the GraphSAINT GitHub repository. PyG downloads the processed version automatically.

BibTeX citation

flickr.bib
@inproceedings{zeng2020graphsaint,
  title={GraphSAINT: Graph Sampling Based Inductive Learning Method},
  author={Zeng, Hanqing and Zhou, Hongkuan and Srivastava, Ajitesh and Kannan, Rajgopal and Prasanna, Viktor},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2020}
}

Cite Zeng et al. for the Flickr dataset and the GraphSAINT sampling method.

Example: content moderation on social platforms

Social platforms must classify user-generated content at scale. Images, posts, and comments need categorization for feed ranking, ad placement, and content moderation. The graph structure (who interacts with what) provides context that individual content features miss. A borderline image posted in a photography community is different from the same image shared in a meme group. Flickr's noisy graph structure mirrors this challenge.

From benchmark to production

Production social graphs have billions of nodes with highly noisy edges. Users follow accounts they never engage with. Bot networks create artificial connections. Content goes viral across unrelated communities. Handling this noise at scale requires models that selectively attend to informative connections while ignoring noise.

Frequently asked questions

What is the Flickr dataset in PyTorch Geometric?

Flickr is a social network of 89,250 images from the Flickr platform. Edges (899,756) connect images that share common properties (same geographic location, gallery, or user comments). Features are 500-dimensional vectors from image descriptions. The task is to classify images into 7 categories.

How does Flickr compare to Reddit for GNN benchmarking?

Flickr is smaller than Reddit (89K vs 232K nodes, 900K vs 114M edges) but larger than Planetoid datasets. It serves as a medium-scale benchmark: large enough to require thoughtful training strategies but small enough to run on a single GPU without sampling.

How do I load Flickr in PyTorch Geometric?

Use `from torch_geometric.datasets import Flickr; dataset = Flickr(root='/tmp/Flickr')`. The dataset provides a standard train/val/test split via masks.

What accuracy should I expect on Flickr?

GraphSAGE achieves ~50.1% accuracy. GCN reaches ~53.4% with full-batch training. These numbers are lower than Reddit because Flickr's image categories are more ambiguous and the graph structure is noisier. Accuracy above 55% is competitive.

Why are Flickr accuracies lower than Reddit?

Flickr's edges are based on shared metadata (location, gallery), which is a noisier signal than Reddit's co-comment structure. Two images from the same location may have completely different content. This noise tests whether GNNs can extract meaningful signal from imperfect graph structure.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.