Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Guide8 min read

Graph Generation: Creating New Graphs with Learned Distributions

Graph generation models learn the distribution of a dataset of graphs and sample new ones from it. This enables molecular design, synthetic data creation, and network simulation without exposing real data.

PyTorch Geometric

TL;DR

  • 1Graph generation creates new graphs that are statistically similar to a training set. The model learns over graph topology and features simultaneously.
  • 2Three approaches: VAEs (encode to latent space, decode back), autoregressive (generate node-by-node), and diffusion (iteratively denoise from random noise).
  • 3Enterprise applications: synthetic data generation (realistic transaction graphs without privacy risk), scenario simulation (stress-test fraud detectors), data augmentation (more examples of rare patterns).
  • 4Current methods work best for small graphs (10-100 nodes). Enterprise-scale generation requires subgraph-level generation stitched into larger networks.
  • 5In PyG: use VGAE (Variational Graph Autoencoder) for latent-space graph generation. Encode with GCNConv, decode with inner product to reconstruct edges.

Graph generation is the task of creating new, realistic graphs by learning the distribution of existing graphs and sampling from it. Unlike discriminative tasks (classification, regression) that predict properties of existing graphs, generative models produce entirely new graphs with plausible structure and features. This is the graph equivalent of image generation (DALL-E, Stable Diffusion) but with the added challenge that graphs have variable size, no spatial ordering, and discrete topology.

Why it matters for enterprise data

Enterprise graph generation addresses three practical needs:

  • Synthetic data for ML development: Generate realistic customer-transaction graphs that preserve statistical properties without containing real customer data. Useful for model development, testing, and third-party collaboration under privacy constraints.
  • Scenario simulation: Generate plausible fraud ring topologies, supply chain disruption patterns, or market crash cascades to stress-test detection and response systems.
  • Data augmentation: Rare patterns (fraud rings, network failures) have too few training examples. Generate additional realistic examples to address class imbalance and improve model robustness.

Three approaches to graph generation

Variational Graph Autoencoders (VGAE)

Encode a graph into a low-dimensional latent vector using a GNN encoder, then decode back to a graph. The latent space is regularized to be smooth (Gaussian prior), so sampling from it produces valid graphs.

Autoregressive generation

Generate graphs sequentially: add one node at a time, decide which edges to connect to existing nodes, assign features. Each step is conditioned on the graph built so far. Allows fine-grained control but is sequential and slow.

Diffusion-based generation

Start from a random graph (noise) and iteratively denoise: refine node features, add/remove edges, and sharpen structure over many steps. Produces high-quality samples but requires many denoising steps.

graph_vae.py
import torch
from torch_geometric.nn import GCNConv, VGAE

class Encoder(torch.nn.Module):
    def __init__(self, in_dim, hidden_dim, latent_dim):
        super().__init__()
        self.conv1 = GCNConv(in_dim, hidden_dim)
        self.conv_mu = GCNConv(hidden_dim, latent_dim)
        self.conv_logstd = GCNConv(hidden_dim, latent_dim)

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index).relu()
        return self.conv_mu(x, edge_index), self.conv_logstd(x, edge_index)

# VGAE: encode graph -> latent space, decode via inner product
model = VGAE(Encoder(data.num_features, 32, 16))
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

for epoch in range(200):
    model.train()
    z = model.encode(data.x, data.edge_index)
    loss = model.recon_loss(z, data.edge_index) + model.kl_loss() / data.num_nodes
    loss.backward()
    optimizer.step()

# Generate: sample from latent space, decode to edges
with torch.no_grad():
    z = torch.randn(100, 16)  # 100 nodes, 16-dim latent
    adj = torch.sigmoid(z @ z.T)  # inner product decoding
    # Threshold to get binary edges
    edges = (adj > 0.5).nonzero().T

VGAE encodes graphs to a smooth latent space. Sampling latent vectors and decoding via inner product generates new graph topologies.

Concrete example: synthetic transaction graphs for testing

A financial institution needs realistic transaction graphs for testing a new fraud detection system but cannot use real customer data due to privacy regulations:

  • Train a VGAE on anonymized graph statistics from the real transaction network
  • Generate 1,000 synthetic subgraphs that match the real network's degree distribution, clustering coefficient, and transaction amount distributions
  • Inject known fraud patterns into a subset of generated subgraphs
  • Use the synthetic dataset to develop and test the fraud detection model

Limitations and what comes next

  1. Scale: Current methods work well for graphs with 10-100 nodes (molecules, small networks). Generating enterprise graphs with millions of nodes requires hierarchical or compositional approaches.
  2. Validity: Generated graphs must satisfy domain constraints (valid molecular valency, realistic transaction amounts, plausible degree distributions). Incorporating hard constraints into generation is an active research area.
  3. Evaluation: Measuring the quality of generated graphs is harder than for images. Common metrics (MMD on graph statistics, FID on embeddings) capture some aspects but miss others.

Frequently asked questions

What is graph generation?

Graph generation is the task of creating new graphs that are statistically similar to a training set of graphs. The model learns the distribution over graph structures (nodes, edges, features) and samples new graphs from that distribution. Applications include drug discovery (generating novel molecular structures), synthetic data creation (generating realistic transaction networks for testing), and network design (generating optimal network topologies).

What are the main approaches to graph generation?

Three main approaches: (1) Variational Autoencoders (VAEs): encode graphs into latent space, decode back to graphs. (2) Autoregressive models: generate graphs node-by-node or edge-by-edge in sequence. (3) Diffusion models: start from noise and iteratively denoise to produce graphs. Each has trade-offs in quality, diversity, and computational cost.

How does graph generation differ from image generation?

Graphs have variable size (different numbers of nodes and edges), no spatial ordering, and discrete structure (edges exist or do not). Images have fixed dimensions, spatial ordering, and continuous pixel values. These differences make graph generation harder: the model must decide both graph topology (which edges exist) and node/edge features simultaneously.

How does graph generation apply to enterprise data?

Synthetic data generation: create realistic customer-transaction graphs for ML model development without exposing real customer data. Scenario simulation: generate plausible fraud ring topologies to stress-test detection systems. Data augmentation: generate additional training examples for rare graph patterns (e.g., fraud networks) to address class imbalance.

Can graph generation create enterprise-scale graphs?

Current graph generation methods work best for small-to-medium graphs (10-100 nodes). Generating enterprise-scale graphs with millions of nodes remains a research challenge. For synthetic enterprise data, a practical approach is to generate realistic local subgraphs (customer neighborhoods) and stitch them together, rather than generating the entire graph at once.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.