Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Dataset7 min read

Yelp: Multi-Label Classification on a 716K-Node Business Graph

Yelp is a business review network of 716,847 nodes with 100 overlapping labels per node. It is one of the few large-scale GNN benchmarks that tests multi-label classification -- a task that better reflects real business problems where entities belong to multiple categories simultaneously.

PyTorch Geometric

TL;DR

  • 1Yelp has 716,847 business nodes, 13,954,819 edges, 300 features, and 100 binary labels per node. It tests multi-label classification, not single-label.
  • 2Multi-label means each business can have multiple categories. This requires sigmoid outputs, binary cross-entropy loss, and F1 evaluation -- different from standard single-label benchmarks.
  • 3At 716K nodes, Yelp is 3x larger than Reddit by node count. Neighbor sampling is required for GPU training.
  • 4Multi-label classification is closer to real business problems (a customer can be in multiple segments, a transaction can trigger multiple risk flags). Yelp is one of the few benchmarks that tests this.

716,847

Nodes

13.9M

Edges

300

Features

100 (multi)

Labels

What Yelp contains

Yelp is a business review network derived from the Yelp platform. Each of the 716,847 nodes represents a business. Edges (13,954,819) connect businesses that share reviewers or similar attributes. Node features are 300-dimensional vectors. Each business has 100 binary labels indicating which categories it belongs to (e.g., Restaurant, Italian, Delivery, Late Night, etc.).

The defining characteristic is multi-label classification. Unlike Cora (where each paper is exactly one category) or Reddit (where each post is exactly one subreddit), Yelp businesses can belong to many categories simultaneously. A single restaurant might be labeled as Italian, Delivery, Casual Dining, and Late Night all at once.

Why Yelp matters

Most GNN benchmarks use single-label classification, which is a simplification of real business problems. In practice, entities almost always belong to multiple categories. A bank customer is simultaneously a high-net-worth individual, a mortgage holder, a credit card user, and a retirement saver. A product is simultaneously electronics, portable, rechargeable, and gift-appropriate. Multi-label prediction is the realistic task, and Yelp is one of the few benchmarks that tests it.

The combination of large scale (716K nodes, 14M edges) and multi-label complexity makes Yelp one of the most demanding standard benchmarks. Models must handle both computational scale and the richer output space of 100 simultaneous binary predictions per node.

Loading Yelp in PyG

load_yelp.py
from torch_geometric.datasets import Yelp
from torch_geometric.loader import NeighborLoader

dataset = Yelp(root='/tmp/Yelp')
data = dataset[0]

print(f"Nodes: {data.num_nodes}")   # 716847
print(f"Edges: {data.num_edges}")   # 13954819
print(f"Labels shape: {data.y.shape}")  # [716847, 100]

# Mini-batch training with neighbor sampling
loader = NeighborLoader(
    data, num_neighbors=[15, 10],
    batch_size=2048, input_nodes=data.train_mask,
)

Note data.y shape: [N, 100] binary matrix. Use BCEWithLogitsLoss, not CrossEntropyLoss.

Original Paper

GraphSAINT: Graph Sampling Based Inductive Learning Method

Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, Viktor Prasanna (2020). ICLR 2020

Read paper →

Benchmark comparison (standard split, micro-F1)

MethodMicro-F1YearPaper
GraphSAGE~0.6342017Hamilton et al.
GraphSAINT~0.6532020Zeng et al.
ClusterGCN~0.6092019Chiang et al.
GAT~0.6472018Velickovic et al.

Which large-scale social dataset should I use?

Flickr (89K nodes, single-label) tests noise robustness at medium scale. Reddit (232K nodes, 41 single-label classes) is the standard for scalable single-label classification. Yelp (716K nodes, 100 multi-labels) is the only large-scale multi-label benchmark -- use it when your task involves overlapping categories. If your production problem has multi-label outputs (customer segments, risk flags, product tags), Yelp is the most relevant benchmark.

Common tasks and benchmarks

The task is multi-label node classification evaluated by micro-averaged F1 score. The dataset provides standard train/val/test masks. GraphSAGE achieves a micro-F1 around 0.63. More advanced methods (GAT with multi-label heads, graph transformers) can exceed 0.65. The key architectural change from single-label to multi-label is replacing the softmax output with independent sigmoid activations and binary cross-entropy loss.

Data source

The Yelp graph dataset was introduced in the GraphSAINT paper and is available from the GraphSAINT GitHub repository. PyG downloads the processed version automatically.

BibTeX citation

yelp.bib
@inproceedings{zeng2020graphsaint,
  title={GraphSAINT: Graph Sampling Based Inductive Learning Method},
  author={Zeng, Hanqing and Zhou, Hongkuan and Srivastava, Ajitesh and Kannan, Rajgopal and Prasanna, Viktor},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2020}
}

Cite Zeng et al. for the Yelp graph benchmark.

Example: multi-segment customer classification

A financial services company wants to segment customers for targeted marketing. Each customer belongs to multiple segments simultaneously: high-income, homeowner, frequent traveler, tech-savvy, retirement planner. The transaction and interaction graph provides context: a customer's connections to merchants, products, and other customers reveal their segment memberships. This is exactly Yelp's multi-label classification task mapped to a financial domain.

From benchmark to production

Production multi-label problems have additional complexity beyond what Yelp captures. Label hierarchies (Italian is a subset of Restaurant), label correlations (Late Night and Bar often co-occur), and label imbalance (rare categories have few positive examples) all affect model design. Production graphs also have temporal dynamics: a business's categories can change over time.

Frequently asked questions

What is the Yelp dataset in PyTorch Geometric?

Yelp is a business review network with 716,847 nodes (businesses) connected by 13,954,819 edges (shared reviewers or similar attributes). Each node has 300-dimensional features and 100 binary labels indicating business types. The task is multi-label classification.

What makes Yelp different from other GNN benchmarks?

Yelp is a multi-label dataset: each business can belong to multiple categories simultaneously (e.g., 'Restaurant', 'Italian', 'Delivery'). Most GNN benchmarks use single-label classification. Multi-label tasks require sigmoid outputs instead of softmax and use different loss functions (binary cross-entropy).

How do I load Yelp in PyTorch Geometric?

Use `from torch_geometric.datasets import Yelp; dataset = Yelp(root='/tmp/Yelp')`. The dataset is large (~1.5GB download) and requires neighbor sampling for GPU training due to its 716K nodes and 14M edges.

What metric should I use for Yelp?

Use micro-averaged F1 score, not accuracy. Multi-label classification requires F1 because accuracy is misleading when a node can have multiple correct labels. Micro-F1 gives equal weight to each label prediction across all nodes.

How large is the Yelp dataset compared to Reddit?

Yelp has 3x more nodes than Reddit (716K vs 232K) but fewer edges (14M vs 114M). This means Yelp is larger but sparser. The combination of large scale and multi-label classification makes it one of the most challenging standard GNN benchmarks.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.