A random walk on a graph is a sequence of nodes generated by starting at a node and repeatedly moving to a uniformly random neighbor, capturing the local and semi-local structure of the graph through the statistical properties of the walk. Random walks are foundational to graph ML. They power node embedding methods (DeepWalk, node2vec), subgraph sampling strategies (GraphSAINT), positional encodings for GNNs, and classical algorithms like PageRank. Despite their simplicity, random walks encode rich structural information that more complex methods build upon.
Why it matters for enterprise data
Random walks on enterprise relational graphs traverse the foreign-key structure naturally. A walk starting at a customer node might follow: customer → order → product → order → another customer. This path captures the relationship between two customers who bought the same product, exactly the collaborative filtering signal that powers recommendations.
The power of random walks is that they automatically explore multi-hop paths without explicit path enumeration. A 10-step walk from a customer traverses 5 hops of relational structure, discovering connections that would require complex SQL joins to enumerate manually.
How random walks work
import torch
from torch_geometric.utils import random_walk
# Generate random walks starting from nodes 0, 1, 2
start_nodes = torch.tensor([0, 1, 2])
walk_length = 10
# Each row is a walk: [start, step1, step2, ..., step10]
walks = random_walk(
data.edge_index[0], data.edge_index[1],
start=start_nodes,
walk_length=walk_length,
)
# walks.shape = [3, 11] (3 walks, 11 nodes each including start)
# Example walk from customer node 0:
# [0, 42, 108, 42, 0, 15, 203, 15, 0, 42, 108]
# Node 0 (customer) -> Node 42 (order) -> Node 108 (product)
# -> back to 42 -> back to 0 -> ...
# The walk naturally traverses relational structureRandom walks traverse the relational graph automatically. The walk from a customer visits its orders and products, revealing purchasing patterns.
Concrete example: product co-occurrence from walks
Consider 100,000 random walks of length 20 starting from customer nodes in an e-commerce graph:
- Products that frequently co-occur within the same walk are related (co-purchased, same category, complementary)
- Customers that co-occur within walks connected through the same products are similar
- The co-occurrence frequency is a measure of relatedness: products appearing in the same walk 500 times are more related than those appearing 5 times
This is the foundation of DeepWalk and node2vec: treat walks as sentences, nodes as words, and train a Word2Vec-style model on the co-occurrence patterns. The result is node embeddings where similar nodes are close in embedding space.
Types of random walks
- Uniform: Each neighbor has equal transition probability. Simple, captures general structure.
- Biased (node2vec): Transition probabilities depend on the previous node. Parameters p and q control exploration vs. exploitation.
- Weighted: Transition probabilities proportional to edge weights. A transaction edge with amount $10,000 gets higher probability than one with $10.
- Temporal: Only follow edges with timestamps after the current step's time. Respects temporal ordering in enterprise data.
Limitations and what comes next
- No feature utilization: Standard random walks only use graph topology. Node features are ignored during walk generation. GNN-based approaches incorporate features directly in message passing.
- Transductive: Walk-based embeddings (DeepWalk, node2vec) are transductive: they produce embeddings only for nodes present during training. New nodes require re-running walks and re-training.
- Walk length vs. coverage trade-off: Short walks capture local structure but miss long-range patterns. Long walks capture more but become noisy as they wander far from the start node.