Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Guide7 min read

Graph Autoencoders: Encode Structure, Decode Edges

A graph autoencoder compresses graph structure into compact latent vectors, then reconstructs the adjacency matrix from those vectors. This unsupervised approach produces node embeddings and enables link prediction without any task labels.

PyTorch Geometric

TL;DR

  • 1A graph autoencoder (GAE) encodes nodes into latent vectors using a GNN, then decodes by predicting edges from vector similarity (dot product). Training reconstructs the adjacency matrix.
  • 2VGAE (variational) extends GAE by learning a distribution over embeddings instead of point estimates. KL divergence regularization produces smoother, more useful latent spaces.
  • 3Primary uses: link prediction (predict missing edges), unsupervised node embedding (for clustering, visualization, downstream tasks), and graph generation (VGAE can sample new graphs).
  • 4PyG provides GAE and VGAE classes. Wrap any GNN encoder, call encode() for embeddings and recon_loss() for training. No labels required.
  • 5Enterprise applications: predicting missing supplier relationships, discovering potential collaborations, finding unlinked related entities in knowledge graphs.

A graph autoencoder encodes graph structure into a latent space, then decodes to reconstruct the original edges. The encoder is a GNN that maps each node to a low-dimensional vector. The decoder predicts whether an edge exists between two nodes based on the similarity of their vectors. Nodes that are connected in the original graph should have similar embeddings. This is unsupervised: no task labels are needed.

Graph autoencoders serve two purposes: producing high-quality node embeddings for downstream tasks, and performing link prediction to discover missing connections. Both are critical for enterprise applications where the graph is always incomplete.

Architecture

A graph autoencoder has two components:

  • Encoder: a GNN (typically 2-layer GCN) that maps each node's features and neighborhood to a latent vector z_i.
  • Decoder: predicts edge probability as sigmoid(z_i · z_j). Nodes with similar embeddings are predicted to be connected.
graph_autoencoder.py
import torch
from torch_geometric.nn import GCNConv, GAE

class GCNEncoder(torch.nn.Module):
    def __init__(self, in_dim, hidden_dim, out_dim):
        super().__init__()
        self.conv1 = GCNConv(in_dim, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, out_dim)

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index).relu()
        return self.conv2(x, edge_index)

# Wrap encoder in GAE
encoder = GCNEncoder(16, 32, 16)
model = GAE(encoder)

# Encode: produce node embeddings
z = model.encode(data.x, data.edge_index)
# z.shape: [num_nodes, 16]

# Decode: reconstruct edges
# model.decoder(z, edge_index) -> edge probabilities

# Loss: binary cross-entropy on edge reconstruction
loss = model.recon_loss(z, data.edge_index)

GAE wraps any GNN encoder. The decoder (inner product) and loss (reconstruction) are built in.

Variational graph autoencoder (VGAE)

VGAE extends GAE by making the encoder probabilistic. Instead of producing a single embedding z_i, the encoder produces a mean vector mu_i and a log-variance vector, defining a Gaussian distribution. The actual embedding is sampled from this distribution during training.

variational_graph_autoencoder.py
from torch_geometric.nn import VGAE

class VGCNEncoder(torch.nn.Module):
    def __init__(self, in_dim, hidden_dim, out_dim):
        super().__init__()
        self.conv1 = GCNConv(in_dim, hidden_dim)
        self.conv_mu = GCNConv(hidden_dim, out_dim)      # mean
        self.conv_logstd = GCNConv(hidden_dim, out_dim)   # log-variance

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index).relu()
        return self.conv_mu(x, edge_index), self.conv_logstd(x, edge_index)

model = VGAE(VGCNEncoder(16, 32, 16))
z = model.encode(data.x, data.edge_index)

# VGAE loss = reconstruction + KL divergence
loss = model.recon_loss(z, data.edge_index) + (1 / data.num_nodes) * model.kl_loss()

VGAE produces distributions instead of point embeddings. KL loss regularizes the latent space.

Enterprise example: supplier relationship discovery

A manufacturing company has a known supplier graph: companies connected by existing supply relationships. But the graph is incomplete: many potential supplier relationships are unknown.

  • Nodes: 50,000 companies with features (industry, size, location, capabilities)
  • Known edges: 200,000 existing supply relationships
  • Goal: discover missing supplier relationships for supply chain diversification

Train a VGAE on the existing supplier graph. The encoder learns company embeddings that capture both features and graph position. Decode all possible pairs: company pairs with high predicted edge probability but no existing edge are candidate new supplier relationships. Ranked by score, the top candidates are the most structurally compatible companies that do not yet have a direct relationship.

Link prediction evaluation

To evaluate a graph autoencoder for link prediction:

  1. Split edges into train (85%), validation (5%), and test (10%)
  2. Train the model on the training edges only
  3. Score held-out positive edges and an equal number of negative (non-existent) edges
  4. Compute AUC and Average Precision

On the Cora citation network, VGAE achieves ~91% AUC for link prediction. On larger enterprise graphs with richer features, performance is typically higher.

Frequently asked questions

What is a graph autoencoder?

A graph autoencoder (GAE) encodes graph structure into low-dimensional latent vectors (one per node), then reconstructs the adjacency matrix from these vectors. The encoder is a GNN that produces node embeddings. The decoder predicts edges by computing similarity (typically dot product) between node embeddings. Training minimizes reconstruction error.

What is the difference between GAE and VGAE?

GAE produces deterministic embeddings. VGAE (Variational Graph Autoencoder) produces a distribution (mean and variance) for each node's embedding and samples from it. VGAE adds a KL divergence regularization term that encourages smooth, well-organized latent spaces. VGAE generally produces better embeddings for downstream tasks.

What are graph autoencoders used for?

Three main uses: (1) Link prediction: the decoder predicts whether edges exist between nodes, finding missing connections. (2) Node embeddings: the encoder produces unsupervised node representations for clustering, visualization, or downstream classifiers. (3) Graph generation: the variational version can sample new graphs from the learned distribution.

How does the decoder work?

The simplest decoder is the inner product: for nodes i and j, the predicted edge probability is sigmoid(z_i dot z_j), where z_i and z_j are the encoder outputs. Nodes with similar embeddings are predicted to be connected. More complex decoders use MLPs or bilinear forms for richer scoring.

How do I train a graph autoencoder in PyG?

PyG provides GAE and VGAE classes in torch_geometric.nn. Create a GNN encoder, wrap it in GAE(encoder) or VGAE(encoder), then call model.encode() for embeddings and model.recon_loss() for the reconstruction loss. For VGAE, add model.kl_loss() as regularization.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.