Graph generation is the task of creating new, realistic graphs by learning the distribution of existing graphs and sampling from it. Unlike discriminative tasks (classification, regression) that predict properties of existing graphs, generative models produce entirely new graphs with plausible structure and features. This is the graph equivalent of image generation (DALL-E, Stable Diffusion) but with the added challenge that graphs have variable size, no spatial ordering, and discrete topology.
Why it matters for enterprise data
Enterprise graph generation addresses three practical needs:
- Synthetic data for ML development: Generate realistic customer-transaction graphs that preserve statistical properties without containing real customer data. Useful for model development, testing, and third-party collaboration under privacy constraints.
- Scenario simulation: Generate plausible fraud ring topologies, supply chain disruption patterns, or market crash cascades to stress-test detection and response systems.
- Data augmentation: Rare patterns (fraud rings, network failures) have too few training examples. Generate additional realistic examples to address class imbalance and improve model robustness.
Three approaches to graph generation
Variational Graph Autoencoders (VGAE)
Encode a graph into a low-dimensional latent vector using a GNN encoder, then decode back to a graph. The latent space is regularized to be smooth (Gaussian prior), so sampling from it produces valid graphs.
Autoregressive generation
Generate graphs sequentially: add one node at a time, decide which edges to connect to existing nodes, assign features. Each step is conditioned on the graph built so far. Allows fine-grained control but is sequential and slow.
Diffusion-based generation
Start from a random graph (noise) and iteratively denoise: refine node features, add/remove edges, and sharpen structure over many steps. Produces high-quality samples but requires many denoising steps.
import torch
from torch_geometric.nn import GCNConv, VGAE
class Encoder(torch.nn.Module):
def __init__(self, in_dim, hidden_dim, latent_dim):
super().__init__()
self.conv1 = GCNConv(in_dim, hidden_dim)
self.conv_mu = GCNConv(hidden_dim, latent_dim)
self.conv_logstd = GCNConv(hidden_dim, latent_dim)
def forward(self, x, edge_index):
x = self.conv1(x, edge_index).relu()
return self.conv_mu(x, edge_index), self.conv_logstd(x, edge_index)
# VGAE: encode graph -> latent space, decode via inner product
model = VGAE(Encoder(data.num_features, 32, 16))
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
for epoch in range(200):
model.train()
z = model.encode(data.x, data.edge_index)
loss = model.recon_loss(z, data.edge_index) + model.kl_loss() / data.num_nodes
loss.backward()
optimizer.step()
# Generate: sample from latent space, decode to edges
with torch.no_grad():
z = torch.randn(100, 16) # 100 nodes, 16-dim latent
adj = torch.sigmoid(z @ z.T) # inner product decoding
# Threshold to get binary edges
edges = (adj > 0.5).nonzero().TVGAE encodes graphs to a smooth latent space. Sampling latent vectors and decoding via inner product generates new graph topologies.
Concrete example: synthetic transaction graphs for testing
A financial institution needs realistic transaction graphs for testing a new fraud detection system but cannot use real customer data due to privacy regulations:
- Train a VGAE on anonymized graph statistics from the real transaction network
- Generate 1,000 synthetic subgraphs that match the real network's degree distribution, clustering coefficient, and transaction amount distributions
- Inject known fraud patterns into a subset of generated subgraphs
- Use the synthetic dataset to develop and test the fraud detection model
Limitations and what comes next
- Scale: Current methods work well for graphs with 10-100 nodes (molecules, small networks). Generating enterprise graphs with millions of nodes requires hierarchical or compositional approaches.
- Validity: Generated graphs must satisfy domain constraints (valid molecular valency, realistic transaction amounts, plausible degree distributions). Incorporating hard constraints into generation is an active research area.
- Evaluation: Measuring the quality of generated graphs is harder than for images. Common metrics (MMD on graph statistics, FID on embeddings) capture some aspects but miss others.