716,847
Nodes
13.9M
Edges
300
Features
100 (multi)
Labels
What Yelp contains
Yelp is a business review network derived from the Yelp platform. Each of the 716,847 nodes represents a business. Edges (13,954,819) connect businesses that share reviewers or similar attributes. Node features are 300-dimensional vectors. Each business has 100 binary labels indicating which categories it belongs to (e.g., Restaurant, Italian, Delivery, Late Night, etc.).
The defining characteristic is multi-label classification. Unlike Cora (where each paper is exactly one category) or Reddit (where each post is exactly one subreddit), Yelp businesses can belong to many categories simultaneously. A single restaurant might be labeled as Italian, Delivery, Casual Dining, and Late Night all at once.
Why Yelp matters
Most GNN benchmarks use single-label classification, which is a simplification of real business problems. In practice, entities almost always belong to multiple categories. A bank customer is simultaneously a high-net-worth individual, a mortgage holder, a credit card user, and a retirement saver. A product is simultaneously electronics, portable, rechargeable, and gift-appropriate. Multi-label prediction is the realistic task, and Yelp is one of the few benchmarks that tests it.
The combination of large scale (716K nodes, 14M edges) and multi-label complexity makes Yelp one of the most demanding standard benchmarks. Models must handle both computational scale and the richer output space of 100 simultaneous binary predictions per node.
Loading Yelp in PyG
from torch_geometric.datasets import Yelp
from torch_geometric.loader import NeighborLoader
dataset = Yelp(root='/tmp/Yelp')
data = dataset[0]
print(f"Nodes: {data.num_nodes}") # 716847
print(f"Edges: {data.num_edges}") # 13954819
print(f"Labels shape: {data.y.shape}") # [716847, 100]
# Mini-batch training with neighbor sampling
loader = NeighborLoader(
data, num_neighbors=[15, 10],
batch_size=2048, input_nodes=data.train_mask,
)Note data.y shape: [N, 100] binary matrix. Use BCEWithLogitsLoss, not CrossEntropyLoss.
Original Paper
GraphSAINT: Graph Sampling Based Inductive Learning Method
Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, Viktor Prasanna (2020). ICLR 2020
Read paper →Benchmark comparison (standard split, micro-F1)
| Method | Micro-F1 | Year | Paper |
|---|---|---|---|
| GraphSAGE | ~0.634 | 2017 | Hamilton et al. |
| GraphSAINT | ~0.653 | 2020 | Zeng et al. |
| ClusterGCN | ~0.609 | 2019 | Chiang et al. |
| GAT | ~0.647 | 2018 | Velickovic et al. |
Which large-scale social dataset should I use?
Flickr (89K nodes, single-label) tests noise robustness at medium scale. Reddit (232K nodes, 41 single-label classes) is the standard for scalable single-label classification. Yelp (716K nodes, 100 multi-labels) is the only large-scale multi-label benchmark -- use it when your task involves overlapping categories. If your production problem has multi-label outputs (customer segments, risk flags, product tags), Yelp is the most relevant benchmark.
Common tasks and benchmarks
The task is multi-label node classification evaluated by micro-averaged F1 score. The dataset provides standard train/val/test masks. GraphSAGE achieves a micro-F1 around 0.63. More advanced methods (GAT with multi-label heads, graph transformers) can exceed 0.65. The key architectural change from single-label to multi-label is replacing the softmax output with independent sigmoid activations and binary cross-entropy loss.
Data source
The Yelp graph dataset was introduced in the GraphSAINT paper and is available from the GraphSAINT GitHub repository. PyG downloads the processed version automatically.
BibTeX citation
@inproceedings{zeng2020graphsaint,
title={GraphSAINT: Graph Sampling Based Inductive Learning Method},
author={Zeng, Hanqing and Zhou, Hongkuan and Srivastava, Ajitesh and Kannan, Rajgopal and Prasanna, Viktor},
booktitle={International Conference on Learning Representations (ICLR)},
year={2020}
}Cite Zeng et al. for the Yelp graph benchmark.
Example: multi-segment customer classification
A financial services company wants to segment customers for targeted marketing. Each customer belongs to multiple segments simultaneously: high-income, homeowner, frequent traveler, tech-savvy, retirement planner. The transaction and interaction graph provides context: a customer's connections to merchants, products, and other customers reveal their segment memberships. This is exactly Yelp's multi-label classification task mapped to a financial domain.
From benchmark to production
Production multi-label problems have additional complexity beyond what Yelp captures. Label hierarchies (Italian is a subset of Restaurant), label correlations (Late Night and Bar often co-occur), and label imbalance (rare categories have few positive examples) all affect model design. Production graphs also have temporal dynamics: a business's categories can change over time.