Smarter Adaptive Graph Sampling for More Accurate Graph Learning | Kumo.ai

01

Why Graph Sampling Matters

Real-world datasets and databases can contain up to billions of nodes that cannot physically be loaded onto a GPU. Graph neural networks (GNNs) need neighborhood information to make predictions, but expanding full neighborhoods across multiple hops leads to exponential memory growth. Graph sampling solves this by selecting relevant subgraphs rather than processing complete graphs.

The dominant approach is GraphSAGE-style sampling, which samples a fixed number of neighbors at each hop level. You configure the sampler with a budget per edge type per hop. For example, in a retailer database:

Hop 1: 10 transactions per customer, 1 location
Hop 2: 1 article per transaction, 2 customers per location

This produces approximately 24 nodes in the sampled subgraph. A more sophisticated six-hop configuration can generate nearly 3,000 nodes, capturing rich multi-hop relational patterns across transactions, articles, customers, and locations.

The sampling configuration directly controls the tradeoff between computational cost and information quality. Sample too few neighbors and you lose critical signals. Sample too many and you blow through GPU memory. The question is: given a fixed compute budget, how do you allocate samples to maximize the information captured?

02

Where Uniform Sampling Breaks Down

GraphSAGE assigns a fixed neighbor budget per edge type per hop. If you configure “sample 10 transactions per customer,” every customer gets exactly 10 transactions sampled, regardless of whether they have 2 transactions or 200. This uniform approach has a fundamental flaw: real-world graphs have highly heterogeneous connectivity.

The distribution problem

In the H&M article purchase dataset, about 20% of customers have fewer than 5 articles in their purchase history, while 45% have 20 or more. When you configure the sampler to pull 10 transactions per customer, those sparse customers cannot fill their budget. A customer with only 1 historical purchase ends up with a sampled subgraph that is fewer than 15% of the intended size.

Where the wasted budget goes

The unfilled transaction slots do not simply disappear. The remaining graph structure gives disproportionate weight to less informative connections. For a customer with a single purchase, the location-based branch of the subgraph dominates, even though location data is typically far less predictive than purchase history for tasks like recommendation or churn prediction.

Why naive fixes fail

The obvious fix is to oversample in later hops to compensate for sparse nodes. This makes things worse. Oversampling in later hops expands the wrong subtrees. Instead of adding more purchase-related nodes, it inflates whichever subtree happens to have capacity, amplifying the imbalance rather than correcting it.

03

Adaptive Metapath-Aware Sampling

The core idea of adaptive sampling is straightforward: when planned neighbors cannot be sampled (because they do not exist), compensate by redistributing that budget to structurally equivalent paths through the graph. The algorithm operates in two steps:

1

Detect Shortfall

At each hop, check if the actual neighbor count is below the configured budget for each edge type.

→

2

Redistribute by Metapath

Allocate unused budget to other nodes sharing the same metapath (sequence of edge types from root to current position).

→

3

Fallback to Children

If no equivalent metapath nodes are available, sample additional children of nodes with structurally equivalent paths.

What is a metapath?

A metapath is the sequence of edge types traversed from the root node to reach a particular position in the sampled subgraph. For example, “customer → transaction → article” is a different metapath than “customer → location → customer.” The adaptive sampler treats these paths as semantically distinct and only redistributes budget within or between paths of the same type.

Why metapath awareness matters

Without metapath awareness, a naive redistribution would dump surplus budget into whatever edge types have available neighbors. This is exactly the problem with the “oversample later hops” approach. Metapath-aware redistribution ensures that purchase-related budget stays purchase-related, and location-related budget stays location-related.

04

How It Works in Practice

Two concrete scenarios illustrate how adaptive sampling handles sparse connectivity in the H&M retailer dataset.

Scenario 1: single-transaction customer

A customer has exactly 1 historical purchase. Under standard GraphSAGE with a budget of 10 transactions, 9 slots go unfilled. The sampled subgraph is tiny and dominated by location connections.

Under adaptive sampling, the algorithm detects the shortfall at hop 1 and redistributes the unused budget. Instead of sampling 10 different transactions (which do not exist), it samples 100 past buyers of that single item. These co-purchase signals from the item's metapath are far more informative than inflating the location subtree.

Scenario 2: limited purchase history (2 items)

A customer has purchased 2 items (one bought by 5 other customers, the other by 45). The total of 50 co-purchasers falls short of the planned 100. The adaptive sampler responds by doubling the transaction samples in subsequent hops for both item subtrees, ensuring the subgraph captures deeper purchase patterns even when breadth is limited.

Scaling up first-hop budgets

Because adaptive redistribution handles sparse nodes gracefully, you can safely increase first-hop neighbor counts far beyond what uniform sampling supports. Configurations like 1,000 transactions and 200 locations at hop 1 become practical. Customers with rich histories get massive subgraphs. Customers with sparse histories get appropriately rebalanced subgraphs through adaptive redistribution. Neither case wastes compute or produces degenerate subgraph structures.

05

Benchmark Results

The adaptive sampler was evaluated against standard GraphSAGE sampling across 9 link prediction tasks from the RelBench benchmark. All tasks use MAP@k (mean average precision at k) as the evaluation metric. Higher is better.

Link prediction performance: GraphSAGE vs. Adaptive Sampling (MAP@k)
Dataset	Task	GraphSAGE	Adaptive	Improvement
rel-amazon	user-item-purchase	0.014	0.021	+50.0%
rel-amazon	user-item-rate	0.017	0.023	+35.3%
rel-amazon	user-item-review	0.013	0.019	+46.1%
rel-avito	user-ad-visit	0.031	0.036	+16.1%
rel-hm	user-item-purchase	0.030	0.032	+6.7%
rel-stack	post-post-related	0.120	0.110	-8.3%
rel-stack	post-post-comment	0.125	0.129	+3.2%
rel-trial	condition-sponsor-run	0.093	0.111	+19.4%
rel-trial	site-sponsor-run	0.244	0.251	+2.9%

Headline numbers

Average improvement: 19% relative gain across all 9 tasks
Amazon datasets: up to 50% improvement on user-item-purchase, the largest gain in the benchmark
Clinical trials: 19.4% improvement on condition-sponsor-run, showing generalization beyond e-commerce
One regression: rel-stack post-post-related drops 8.3%, the only task where adaptive sampling underperforms

Why Amazon benefits most

Amazon's product catalog creates extreme heterogeneity in node connectivity. Some users have thousands of purchases while others have just a handful. This is exactly the scenario where uniform sampling wastes the most budget, and where adaptive redistribution recovers the most information.

Performance across customer segments

On the H&M dataset, performance improvements appear across virtually all customer segments, including those with approximately 10 transactions where the fixed configuration should theoretically perform optimally. This suggests adaptive sampling provides benefits beyond just handling sparse nodes: the redistribution mechanism discovers useful structural patterns even when the base budget is not constrained.

06

When Adaptive Sampling Does Not Help

Adaptive sampling is not universally beneficial. Three conditions must hold for the method to provide meaningful gains:

At least two sequenced primary-to-foreign key connections. The redistribution mechanism operates across multi-hop paths. If the graph has only single-hop relationships, there is nothing to redistribute.
At least three hops in the sampling configuration. With only one or two hops, the sampler does not have enough depth to detect shortfalls and compensate in subsequent layers.
Increased neighbor sampling must demonstrably improve accuracy. For tasks where most predictive information exists in base table features (rather than graph structure), adaptive sampling adds computational overhead without improving predictions.

The rel-stack post-post-related result illustrates this. Stack Overflow's post-to-post relationships are relatively uniform in connectivity, and the predictive signal comes primarily from post content rather than deep graph structure. Adaptive sampling adds complexity without finding better structural patterns, resulting in an 8.3% decrease.

07

Key Takeaways

Adaptive metapath-aware sampling addresses a structural weakness in how GNNs handle real-world relational data. The core contributions and practical implications:

Uniform sampling systematically underserves sparse nodes. In the H&M dataset, 20% of customers produce subgraphs smaller than 15% of the intended size. This is not a rare edge case; it is a structural property of heterogeneous graphs.
Metapath-aware redistribution preserves semantic balance. By redistributing budget within equivalent metapaths, the algorithm avoids inflating irrelevant subtrees while ensuring every node gets a complete, informative subgraph.
19% average improvement on RelBench link prediction. With gains up to 50% on Amazon datasets and consistent improvements across customer segments, not only on sparse nodes.
Enables aggressive first-hop configurations. Budgets of 1,000+ transactions at hop 1 become practical because adaptive redistribution gracefully handles the resulting variance in subgraph sizes.
Know when to use it. The method requires multi-hop heterogeneous graphs with at least three sampling hops. On uniform, shallow graphs, standard GraphSAGE sampling remains appropriate.

For production GNN systems operating on relational databases, adaptive sampling is a low-effort, high-impact improvement. It requires no changes to model architecture, no additional training, and no hyperparameter tuning beyond the sampling configuration you already maintain. The sampler simply uses its budget more intelligently.