Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Dataset6 min read

Amazon Photo: A Compact Co-Purchase Graph for Fast GNN Experiments

Amazon Photo is a dense co-purchase network of 7,650 photography products. Smaller than Amazon Computers but structurally similar, it is ideal for quick iteration on GNN architectures for product recommendation tasks.

PyTorch Geometric

TL;DR

  • 1Amazon Photo has 7,650 product nodes, 238,162 co-purchase edges, 745-dimensional review features, and 8 product categories.
  • 2Dense connectivity (avg degree ~31) provides rich neighborhood signal. Most GNN architectures perform well when given this much co-purchase context.
  • 3Its compact size enables rapid experimentation. Use it as a fast development dataset before scaling to Amazon Computers or OGB-Products.
  • 4Co-purchase patterns in Photo mirror real recommendation challenges: learning product similarity from buying behavior.

7,650

Nodes

238,162

Edges

745

Features

8

Classes

What Amazon Photo contains

Amazon Photo is a co-purchase network from Amazon's Photography product category. Each of the 7,650 nodes is a product (cameras, lenses, tripods, accessories). The 238,162 edges connect products frequently purchased together. Node features are 745-dimensional bag-of-words vectors derived from product reviews. The 8 classes represent product subcategories.

Like Amazon Computers, this graph is dense. The average degree is ~31, meaning each product has co-purchase connections to roughly 30 other products. Photography accessories cluster tightly: a camera body connects to lenses, memory cards, bags, and tripods, creating a rich local neighborhood for GNN aggregation.

Why Amazon Photo matters

Amazon Photo serves as the quick-iteration counterpart to Amazon Computers. When developing a GNN architecture for product recommendation, you want a fast development cycle. Photo trains in seconds and captures the same co-purchase patterns as larger product graphs. Once your model works on Photo, scale to Computers (13K nodes) and then OGB-Products (2.4M nodes) for production validation.

The photography domain also illustrates a specific recommendation challenge: complementary products. Cameras and lenses are complements, not substitutes. A good recommendation model must learn that a user who bought a Canon camera should be recommended Canon-compatible lenses, not a competing Nikon camera. Co-purchase edges encode this complementarity directly.

Loading Amazon Photo in PyG

load_amazon_photo.py
from torch_geometric.datasets import Amazon

dataset = Amazon(root='/tmp/Amazon', name='Photo')
data = dataset[0]

print(f"Nodes: {data.num_nodes}")        # 7650
print(f"Edges: {data.num_edges}")        # 238162
print(f"Features: {data.num_features}")  # 745
print(f"Classes: {dataset.num_classes}") # 8

Same Amazon API as Computers. No standard split -- use random train/val/test partitioning.

Original Paper

Pitfalls of Graph Neural Network Evaluation

Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, Stephan Gunnemann (2018). NeurIPS 2018 Workshop on Relational Representation Learning

Read paper →

Benchmark comparison (random 60/20/20 splits)

MethodAccuracyYearPaper
MLP (no graph)~78.5%--Baseline
GCN~91.2%2017Kipf & Welling
GAT~91.7%2018Velickovic et al.
GraphSAGE~91.0%2017Hamilton et al.
GCNII~92.4%2020Chen et al.

Which Amazon co-purchase dataset should I use?

Amazon Photo (7,650 nodes, 8 classes) is the smallest and fastest -- ideal for prototyping and quick iteration. Amazon Computers (13,752 nodes, 10 classes) has nearly double the nodes and two more classes. OGB-Products (2.4M nodes, 47 classes) is the production-scale option with a standardized split. Start with Photo, graduate to Computers, and validate at scale on OGB-Products.

Common tasks and benchmarks

Node classification is the standard task: predict each product's subcategory from co-purchase structure and review features. With no canonical split, researchers typically use random 60/20/20 or 10/10/80 partitions. Most GNN architectures achieve 90%+ accuracy on favorable splits, thanks to the dense graph structure and relatively few categories.

Link prediction is a natural alternative task: given a partial co-purchase graph, predict which product pairs will be bought together. This directly mirrors the recommendation use case and is often more interesting than node classification for practitioners building recommendation systems.

Data source

The Amazon co-purchase datasets were introduced by Shchur et al. (2018) and are derived from the Amazon product metadata from SNAP. PyG downloads the processed version automatically.

BibTeX citation

amazon_photo.bib
@article{shchur2018pitfalls,
  title={Pitfalls of Graph Neural Network Evaluation},
  author={Shchur, Oleksandr and Mumme, Maximilian and Bojchevski, Aleksandar and G{\"u}nnemann, Stephan},
  journal={arXiv preprint arXiv:1811.05868},
  year={2018}
}

@inproceedings{mcauley2015image,
  title={Image-Based Recommendations on Styles and Substitutes},
  author={McAuley, Julian and Targett, Christopher and Shi, Qinfeng and van den Hengel, Anton},
  booktitle={SIGIR},
  year={2015}
}

Cite Shchur et al. for the benchmark, McAuley et al. for the original Amazon data.

Example: photography accessory bundles

An e-commerce platform wants to suggest product bundles. The co-purchase graph reveals which products naturally go together: a camera body, a 50mm lens, a memory card, and a carrying case form a natural bundle. GNN aggregation discovers these clusters by learning that tightly connected product neighborhoods represent complementary sets. Amazon Photo provides the training data for this exact task.

From benchmark to production

Production product graphs include user nodes (who bought what), temporal ordering (recent purchases matter more), and multiple interaction types (view, click, purchase, return). Amazon Photo captures only the product-to-product co-purchase layer. A production system must integrate all these signals into a unified heterogeneous graph.

Frequently asked questions

What is the Amazon Photo dataset?

Amazon Photo is a co-purchase network of 7,650 photography products from Amazon. Edges (238,162) connect products frequently bought together. Features are 745-dimensional bag-of-words vectors from reviews. Products belong to 8 categories.

How does Amazon Photo compare to Amazon Computers?

Amazon Photo is smaller (7,650 vs 13,752 nodes) with fewer edges (238,162 vs 491,722) and categories (8 vs 10). The average degree is similar (~31), so both are dense co-purchase graphs. Photo trains faster and is good for quick experiments.

How do I load Amazon Photo in PyTorch Geometric?

Use `from torch_geometric.datasets import Amazon; dataset = Amazon(root='/tmp/Amazon', name='Photo')`. Same API as Amazon Computers, just change the name.

What tasks can I perform on Amazon Photo?

The standard task is node classification (predict product category). You can also use it for link prediction (predict co-purchase relationships), graph representation learning, and testing GNN-based recommendation approaches on a compact dataset.

Is Amazon Photo too small for meaningful experiments?

Amazon Photo is a good starting point for co-purchase experiments because it trains fast and captures the same structural patterns as larger product graphs. For production-relevant scale, graduate to OGB-Products (2.4M nodes) or industry-specific datasets.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.