Node2Vec is an unsupervised graph embedding method that learns dense vector representations for nodes by performing biased random walks on the graph and training a skip-gram model to predict co-occurring nodes within each walk. Introduced by Grover and Leskovec in 2016, it extends DeepWalk with two hyperparameters (p, q) that control whether walks explore broadly (capturing structural roles) or stay local (capturing community membership). Node2Vec embeddings are useful for clustering, visualization, and as pre-trained features for downstream classifiers.
Why it matters for enterprise data
Node2Vec provides a fast, unsupervised way to extract graph structure into vectors that standard ML models can consume. For enterprise teams without deep GNN expertise, Node2Vec offers a path from relational graph to predictions:
- Build graph from relational database (foreign keys = edges)
- Run Node2Vec to get embeddings per entity
- Feed embeddings as features to XGBoost, LightGBM, or logistic regression
- Predict churn, fraud, or recommendations using standard ML pipelines
This “graph feature engineering” approach captures relational patterns without requiring a full GNN pipeline. It is a practical first step for teams exploring graph ML.
How Node2Vec works
Step 1: Generate biased random walks
Starting from every node, generate multiple walks of fixed length (e.g., 20 steps). The bias parameters p and q determine transition probabilities:
- p (return parameter): Controls probability of returning to the previous node. Low p = more likely to backtrack = BFS-like local exploration.
- q (in-out parameter): Controls probability of moving to nodes further from the start. Low q = more likely to explore outward = DFS-like exploration.
Step 2: Train skip-gram model
Treat each walk as a “sentence” and each node as a “word.” Train a skip-gram model to predict context nodes (within a window) from a center node. Nodes that frequently co-occur in walks get similar embeddings.
import torch
from torch_geometric.nn import Node2Vec
# Initialize Node2Vec
model = Node2Vec(
data.edge_index,
embedding_dim=128, # 128-dim embeddings
walk_length=20, # 20-step walks
context_size=10, # skip-gram window
walks_per_node=10, # 10 walks per node
num_negative_samples=1, # 1 negative per positive
p=1.0, # return parameter
q=0.5, # in-out parameter
num_nodes=data.num_nodes,
)
# Training
loader = model.loader(batch_size=128, shuffle=True, num_workers=4)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
for epoch in range(100):
for pos_rw, neg_rw in loader:
loss = model.loss(pos_rw, neg_rw)
loss.backward()
optimizer.step()
optimizer.zero_grad()
# Get embeddings for all nodes
embeddings = model() # shape: [num_nodes, 128]
# Use embeddings for downstream tasks
from sklearn.cluster import KMeans
clusters = KMeans(n_clusters=10).fit_predict(embeddings.detach().numpy())PyG's Node2Vec handles walk generation and skip-gram training. The resulting embeddings can be clustered, visualized, or fed to any standard ML model.
Concrete example: customer similarity from transaction graphs
A bank runs Node2Vec on a customer-merchant bipartite graph:
- 500,000 customer nodes, 50,000 merchant nodes
- 10M transaction edges
- p=1, q=0.5 (moderate outward exploration)
- 128-dimensional embeddings, 20-step walks, 10 walks per node
After training, customers with similar spending patterns (same merchants, same categories) have similar embeddings. K-means clustering on embeddings reveals 15 natural customer segments: luxury shoppers, grocery-focused, travel-heavy, etc. These segments emerge from transaction topology alone, without any customer features or labels.
Limitations and what comes next
- Transductive only: Fixed embedding table. New customers get no embedding until the entire model is retrained. Production systems require inductive approaches.
- No feature usage: Ignores node attributes. A customer's age, income, and account type are invisible to Node2Vec. GNNs incorporate these features directly.
- p, q sensitivity: Optimal p and q values vary by task and graph. Grid search over p, q is necessary, adding to development time.
For supervised enterprise tasks, GNNs (GCNConv, GAT, SAGEConv) outperform Node2Vec because they use features and are optimized end-to-end for the prediction target. KumoRFM takes this further with a pre-trained graph transformer that generalizes across relational databases.