What is a graph autoencoder?

A graph autoencoder (GAE) encodes graph structure into low-dimensional latent vectors (one per node), then reconstructs the adjacency matrix from these vectors. The encoder is a GNN that produces node embeddings. The decoder predicts edges by computing similarity (typically dot product) between node embeddings. Training minimizes reconstruction error.

What is the difference between GAE and VGAE?

GAE produces deterministic embeddings. VGAE (Variational Graph Autoencoder) produces a distribution (mean and variance) for each node's embedding and samples from it. VGAE adds a KL divergence regularization term that encourages smooth, well-organized latent spaces. VGAE generally produces better embeddings for downstream tasks.

What are graph autoencoders used for?

Three main uses: (1) Link prediction: the decoder predicts whether edges exist between nodes, finding missing connections. (2) Node embeddings: the encoder produces unsupervised node representations for clustering, visualization, or downstream classifiers. (3) Graph generation: the variational version can sample new graphs from the learned distribution.

How does the decoder work?

The simplest decoder is the inner product: for nodes i and j, the predicted edge probability is sigmoid(z_i dot z_j), where z_i and z_j are the encoder outputs. Nodes with similar embeddings are predicted to be connected. More complex decoders use MLPs or bilinear forms for richer scoring.

How do I train a graph autoencoder in PyG?

PyG provides GAE and VGAE classes in torch_geometric.nn. Create a GNN encoder, wrap it in GAE(encoder) or VGAE(encoder), then call model.encode() for embeddings and model.recon_loss() for the reconstruction loss. For VGAE, add model.kl_loss() as regularization.

Graph Autoencoders: Encoding Graph Structure into Latent Space | Kumo.ai

A graph autoencoder encodes graph structure into a latent space, then decodes to reconstruct the original edges. The encoder is a GNN that maps each node to a low-dimensional vector. The decoder predicts whether an edge exists between two nodes based on the similarity of their vectors. Nodes that are connected in the original graph should have similar embeddings. This is unsupervised: no task labels are needed.

Graph autoencoders serve two purposes: producing high-quality node embeddings for downstream tasks, and performing link prediction to discover missing connections. Both are critical for enterprise applications where the graph is always incomplete.

Architecture

A graph autoencoder has two components:

Encoder: a GNN (typically 2-layer GCN) that maps each node's features and neighborhood to a latent vector z_i.
Decoder: predicts edge probability as sigmoid(z_i · z_j). Nodes with similar embeddings are predicted to be connected.

graph_autoencoder.py

import torch
from torch_geometric.nn import GCNConv, GAE

class GCNEncoder(torch.nn.Module):
    def __init__(self, in_dim, hidden_dim, out_dim):
        super().__init__()
        self.conv1 = GCNConv(in_dim, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, out_dim)

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index).relu()
        return self.conv2(x, edge_index)

# Wrap encoder in GAE
encoder = GCNEncoder(16, 32, 16)
model = GAE(encoder)

# Encode: produce node embeddings
z = model.encode(data.x, data.edge_index)
# z.shape: [num_nodes, 16]

# Decode: reconstruct edges
# model.decoder(z, edge_index) -> edge probabilities

# Loss: binary cross-entropy on edge reconstruction
loss = model.recon_loss(z, data.edge_index)

GAE wraps any GNN encoder. The decoder (inner product) and loss (reconstruction) are built in.

Variational graph autoencoder (VGAE)

VGAE extends GAE by making the encoder probabilistic. Instead of producing a single embedding z_i, the encoder produces a mean vector mu_i and a log-variance vector, defining a Gaussian distribution. The actual embedding is sampled from this distribution during training.

variational_graph_autoencoder.py

from torch_geometric.nn import VGAE

class VGCNEncoder(torch.nn.Module):
    def __init__(self, in_dim, hidden_dim, out_dim):
        super().__init__()
        self.conv1 = GCNConv(in_dim, hidden_dim)
        self.conv_mu = GCNConv(hidden_dim, out_dim)      # mean
        self.conv_logstd = GCNConv(hidden_dim, out_dim)   # log-variance

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index).relu()
        return self.conv_mu(x, edge_index), self.conv_logstd(x, edge_index)

model = VGAE(VGCNEncoder(16, 32, 16))
z = model.encode(data.x, data.edge_index)

# VGAE loss = reconstruction + KL divergence
loss = model.recon_loss(z, data.edge_index) + (1 / data.num_nodes) * model.kl_loss()

VGAE produces distributions instead of point embeddings. KL loss regularizes the latent space.

Enterprise example: supplier relationship discovery

A manufacturing company has a known supplier graph: companies connected by existing supply relationships. But the graph is incomplete: many potential supplier relationships are unknown.

Nodes: 50,000 companies with features (industry, size, location, capabilities)
Known edges: 200,000 existing supply relationships
Goal: discover missing supplier relationships for supply chain diversification

Train a VGAE on the existing supplier graph. The encoder learns company embeddings that capture both features and graph position. Decode all possible pairs: company pairs with high predicted edge probability but no existing edge are candidate new supplier relationships. Ranked by score, the top candidates are the most structurally compatible companies that do not yet have a direct relationship.

Link prediction evaluation

To evaluate a graph autoencoder for link prediction:

Split edges into train (85%), validation (5%), and test (10%)
Train the model on the training edges only
Score held-out positive edges and an equal number of negative (non-existent) edges
Compute AUC and Average Precision

On the Cora citation network, VGAE achieves ~91% AUC for link prediction. On larger enterprise graphs with richer features, performance is typically higher.

Key Takeaways

1A graph autoencoder (GAE) encodes nodes into latent vectors via a GNN, then decodes edge probabilities via dot product. Training reconstructs the adjacency matrix without any task labels.
2VGAE adds variational inference: the encoder produces distributions (mean + variance) instead of point embeddings. KL regularization creates smooth, well-organized latent spaces.
3Two primary uses: unsupervised node embeddings for downstream tasks, and link prediction to discover missing connections in incomplete graphs.
4PyG provides GAE and VGAE as wrapper classes. Plug in any GNN encoder. The decoder and loss functions are built in.
5Enterprise applications include supplier discovery, knowledge graph completion, social network link prediction, and generating embeddings for clustering and visualization.

Graph Autoencoders: Encode Structure, Decode Edges