What is graph generation?

Graph generation is the task of creating new graphs that are statistically similar to a training set of graphs. The model learns the distribution over graph structures (nodes, edges, features) and samples new graphs from that distribution. Applications include drug discovery (generating novel molecular structures), synthetic data creation (generating realistic transaction networks for testing), and network design (generating optimal network topologies).

What are the main approaches to graph generation?

Three main approaches: (1) Variational Autoencoders (VAEs): encode graphs into latent space, decode back to graphs. (2) Autoregressive models: generate graphs node-by-node or edge-by-edge in sequence. (3) Diffusion models: start from noise and iteratively denoise to produce graphs. Each has trade-offs in quality, diversity, and computational cost.

How does graph generation differ from image generation?

Graphs have variable size (different numbers of nodes and edges), no spatial ordering, and discrete structure (edges exist or do not). Images have fixed dimensions, spatial ordering, and continuous pixel values. These differences make graph generation harder: the model must decide both graph topology (which edges exist) and node/edge features simultaneously.

How does graph generation apply to enterprise data?

Synthetic data generation: create realistic customer-transaction graphs for ML model development without exposing real customer data. Scenario simulation: generate plausible fraud ring topologies to stress-test detection systems. Data augmentation: generate additional training examples for rare graph patterns (e.g., fraud networks) to address class imbalance.

Can graph generation create enterprise-scale graphs?

Current graph generation methods work best for small-to-medium graphs (10-100 nodes). Generating enterprise-scale graphs with millions of nodes remains a research challenge. For synthetic enterprise data, a practical approach is to generate realistic local subgraphs (customer neighborhoods) and stitch them together, rather than generating the entire graph at once.

Graph Generation: Creating New Graphs with Learned Distributions | Kumo.ai

Graph generation is the task of creating new, realistic graphs by learning the distribution of existing graphs and sampling from it. Unlike discriminative tasks (classification, regression) that predict properties of existing graphs, generative models produce entirely new graphs with plausible structure and features. This is the graph equivalent of image generation (DALL-E, Stable Diffusion) but with the added challenge that graphs have variable size, no spatial ordering, and discrete topology.

Why it matters for enterprise data

Enterprise graph generation addresses three practical needs:

Synthetic data for ML development: Generate realistic customer-transaction graphs that preserve statistical properties without containing real customer data. Useful for model development, testing, and third-party collaboration under privacy constraints.
Scenario simulation: Generate plausible fraud ring topologies, supply chain disruption patterns, or market crash cascades to stress-test detection and response systems.
Data augmentation: Rare patterns (fraud rings, network failures) have too few training examples. Generate additional realistic examples to address class imbalance and improve model robustness.

Three approaches to graph generation

Variational Graph Autoencoders (VGAE)

Encode a graph into a low-dimensional latent vector using a GNN encoder, then decode back to a graph. The latent space is regularized to be smooth (Gaussian prior), so sampling from it produces valid graphs.

Autoregressive generation

Generate graphs sequentially: add one node at a time, decide which edges to connect to existing nodes, assign features. Each step is conditioned on the graph built so far. Allows fine-grained control but is sequential and slow.

Diffusion-based generation

Start from a random graph (noise) and iteratively denoise: refine node features, add/remove edges, and sharpen structure over many steps. Produces high-quality samples but requires many denoising steps.

graph_vae.py

import torch
from torch_geometric.nn import GCNConv, VGAE

class Encoder(torch.nn.Module):
    def __init__(self, in_dim, hidden_dim, latent_dim):
        super().__init__()
        self.conv1 = GCNConv(in_dim, hidden_dim)
        self.conv_mu = GCNConv(hidden_dim, latent_dim)
        self.conv_logstd = GCNConv(hidden_dim, latent_dim)

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index).relu()
        return self.conv_mu(x, edge_index), self.conv_logstd(x, edge_index)

# VGAE: encode graph -> latent space, decode via inner product
model = VGAE(Encoder(data.num_features, 32, 16))
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

for epoch in range(200):
    model.train()
    z = model.encode(data.x, data.edge_index)
    loss = model.recon_loss(z, data.edge_index) + model.kl_loss() / data.num_nodes
    loss.backward()
    optimizer.step()

# Generate: sample from latent space, decode to edges
with torch.no_grad():
    z = torch.randn(100, 16)  # 100 nodes, 16-dim latent
    adj = torch.sigmoid(z @ z.T)  # inner product decoding
    # Threshold to get binary edges
    edges = (adj > 0.5).nonzero().T

VGAE encodes graphs to a smooth latent space. Sampling latent vectors and decoding via inner product generates new graph topologies.

Concrete example: synthetic transaction graphs for testing

A financial institution needs realistic transaction graphs for testing a new fraud detection system but cannot use real customer data due to privacy regulations:

Train a VGAE on anonymized graph statistics from the real transaction network
Generate 1,000 synthetic subgraphs that match the real network's degree distribution, clustering coefficient, and transaction amount distributions
Inject known fraud patterns into a subset of generated subgraphs
Use the synthetic dataset to develop and test the fraud detection model

Limitations and what comes next

Scale: Current methods work well for graphs with 10-100 nodes (molecules, small networks). Generating enterprise graphs with millions of nodes requires hierarchical or compositional approaches.
Validity: Generated graphs must satisfy domain constraints (valid molecular valency, realistic transaction amounts, plausible degree distributions). Incorporating hard constraints into generation is an active research area.
Evaluation: Measuring the quality of generated graphs is harder than for images. Common metrics (MMD on graph statistics, FID on embeddings) capture some aspects but miss others.

Key Takeaways

1Graph generation creates new realistic graphs by learning the distribution of existing graphs. It produces novel graph topologies and features, not just predictions about existing graphs.
2Three approaches: VAEs (smooth latent space), autoregressive (node-by-node), diffusion (iterative denoising). VAEs are simplest; diffusion produces highest quality.
3Enterprise value: synthetic data for privacy-safe ML development, scenario simulation for stress-testing, and data augmentation for rare-pattern detection.
4Scale limitation: current methods handle 10-100 node graphs. Enterprise-scale generation requires hierarchical approaches that generate and compose subgraphs.
5In PyG: VGAE provides a ready-made graph VAE. Encode with GCNConv, decode with inner product. Sample from latent space to generate new graphs.

Graph Generation: Creating New Graphs with Learned Distributions