What is a heterogeneous graph in PyG?

A heterogeneous graph (HeteroData) is a graph with multiple node types and edge types. In PyG, each node type has its own feature tensor and each edge type has its own edge_index. This maps naturally to relational databases where each table becomes a node type and each foreign key becomes an edge type.

How do I convert a SQL database to a PyG HeteroData object?

Export each table as a node type with its columns as features. For each foreign key relationship, create an edge type connecting the two table node types. Use torch_geometric.data.HeteroData to store the result. Handle feature alignment (numeric vs categorical) and ensure consistent indexing across tables.

Should I use HeteroConv or HGTConv for heterogeneous graphs?

HeteroConv is a wrapper that lets you assign any convolution layer per edge type, giving maximum flexibility. HGTConv is a dedicated heterogeneous transformer that learns type-specific attention. Start with HeteroConv wrapping SAGEConv for simplicity; upgrade to HGTConv when you need cross-type attention.

Building Heterogeneous Graphs from Relational Databases | PyG Guide

Why relational data needs heterogeneous graphs

A relational database with customers, orders, and products has three entity types connected by different relationship types. Flattening this into a single homogeneous graph (all nodes the same type) loses the structural information that makes relational data powerful.

Heterogeneous graphs preserve this structure. Each table maps to a node type, each foreign key maps to an edge type, and each column maps to a feature dimension. The GNN then learns type-specific transformations that respect the semantics of each relationship.

The conversion pipeline

Converting a relational database to a PyG HeteroData object involves four steps, each with production pitfalls:

Step 1: Schema mapping

Map each table to a node type and each foreign key to an edge type. For a typical e-commerce database:

schema_mapping.py

from torch_geometric.data import HeteroData

data = HeteroData()

# Node types (one per table)
# customers: 100K rows -> 100K nodes
# products: 50K rows -> 50K nodes
# orders: 2M rows -> 2M nodes

# Edge types (one per foreign key)
# orders.customer_id -> ("order", "placed_by", "customer")
# orders.product_id  -> ("order", "contains", "product")
# reviews.customer_id -> ("customer", "reviewed", "product")

Each table becomes a node type. Each foreign key becomes a directed edge type. Reverse edges are added automatically by PyG's T.ToUndirected() transform.

Step 2: Feature encoding

This is where most production models break. Each column type needs different treatment:

Numeric columns: Normalize to zero mean, unit variance. Handle nulls with a separate indicator feature, not zero-filling (which conflates “missing” with “zero”).
Categorical columns: Use learned embeddings, not one-hot encoding. One-hot encoding on high-cardinality columns (like product IDs with 50K categories) creates 50K-dimensional sparse vectors that destroy training speed.
Text columns: Encode with a pretrained model (sentence-transformers) into fixed-size vectors. Do not tokenize and pass raw tokens.
Timestamps: Convert to relative time differences, cyclical encodings (hour-of-day, day-of-week), or temporal positional encodings. Raw Unix timestamps are meaningless to GNNs.

feature_encoding.py

import torch
import numpy as np
from sklearn.preprocessing import StandardScaler

# Numeric features: normalize
scaler = StandardScaler()
numeric_feats = scaler.fit_transform(df[numeric_cols].fillna(0).values)

# Null indicators (critical for production accuracy)
null_mask = df[numeric_cols].isnull().values.astype(np.float32)

# Combine: [normalized_values | null_indicators]
node_features = np.concatenate([numeric_feats, null_mask], axis=1)
data["customer"].x = torch.tensor(node_features, dtype=torch.float32)

Always add null indicator features. Models that zero-fill without indicators lose 2-5% accuracy on real-world data with missing values.

Step 3: Edge construction

Build edge_index tensors from foreign keys. The critical detail: indices must be local to each node type, not global. If you have 100K customers and 50K products, customer indices run 0 to 99,999 and product indices run 0 to 49,999 independently.

edge_construction.py

# Map global IDs to local indices
customer_id_map = {gid: i for i, gid in enumerate(customers_df["id"])}
product_id_map = {gid: i for i, gid in enumerate(products_df["id"])}

# Build edge_index from foreign keys
src = [customer_id_map[cid] for cid in orders_df["customer_id"]]
dst = [product_id_map[pid] for pid in orders_df["product_id"]]

data["customer", "purchased", "product"].edge_index = torch.tensor(
    [src, dst], dtype=torch.long
)

Step 4: Leakage audit

The most dangerous production pitfall. Feature leakage means your graph contains information that would not be available at prediction time. Common sources:

Target-derived columns: A “total_spend” column that includes the order you are trying to predict.
Future edges: Including orders that happen after the prediction timestamp.
Aggregated statistics: Precomputed averages that incorporate the target period.

Choosing the right layer for heterogeneous data

Once you have a HeteroData object, you need a layer that respects node and edge types:

HeteroConv: Wraps any homogeneous layer (SAGEConv, GATConv) with per-edge-type instances. Most flexible, easiest to start with.
RGCNConv: Per-relation weight matrices. Good for knowledge graphs with many relation types but few node types.
HGTConv: Full heterogeneous graph transformer with type-specific attention. Best accuracy but highest compute cost.

What breaks in production

Three issues that only appear at scale:

Schema changes: A new column or table in your database requires rebuilding the graph and retraining the model. This makes weekly model refreshes fragile.
ID mapping drift: If customer IDs are not stable (reindexed, deleted), your edge_index becomes stale. You need a stable mapping layer.
Memory at scale: A database with 100M rows across 10 tables produces a graph that does not fit in GPU memory. You need neighbor sampling (see the neighbor sampling guide).

Building Heterogeneous Graphs from Relational Databases