Why relational data needs heterogeneous graphs
A relational database with customers, orders, and products has three entity types connected by different relationship types. Flattening this into a single homogeneous graph (all nodes the same type) loses the structural information that makes relational data powerful.
Heterogeneous graphs preserve this structure. Each table maps to a node type, each foreign key maps to an edge type, and each column maps to a feature dimension. The GNN then learns type-specific transformations that respect the semantics of each relationship.
The conversion pipeline
Converting a relational database to a PyG HeteroData object involves four steps, each with production pitfalls:
Step 1: Schema mapping
Map each table to a node type and each foreign key to an edge type. For a typical e-commerce database:
from torch_geometric.data import HeteroData
data = HeteroData()
# Node types (one per table)
# customers: 100K rows -> 100K nodes
# products: 50K rows -> 50K nodes
# orders: 2M rows -> 2M nodes
# Edge types (one per foreign key)
# orders.customer_id -> ("order", "placed_by", "customer")
# orders.product_id -> ("order", "contains", "product")
# reviews.customer_id -> ("customer", "reviewed", "product")Each table becomes a node type. Each foreign key becomes a directed edge type. Reverse edges are added automatically by PyG's T.ToUndirected() transform.
Step 2: Feature encoding
This is where most production models break. Each column type needs different treatment:
- Numeric columns: Normalize to zero mean, unit variance. Handle nulls with a separate indicator feature, not zero-filling (which conflates “missing” with “zero”).
- Categorical columns: Use learned embeddings, not one-hot encoding. One-hot encoding on high-cardinality columns (like product IDs with 50K categories) creates 50K-dimensional sparse vectors that destroy training speed.
- Text columns: Encode with a pretrained model (sentence-transformers) into fixed-size vectors. Do not tokenize and pass raw tokens.
- Timestamps: Convert to relative time differences, cyclical encodings (hour-of-day, day-of-week), or temporal positional encodings. Raw Unix timestamps are meaningless to GNNs.
import torch
import numpy as np
from sklearn.preprocessing import StandardScaler
# Numeric features: normalize
scaler = StandardScaler()
numeric_feats = scaler.fit_transform(df[numeric_cols].fillna(0).values)
# Null indicators (critical for production accuracy)
null_mask = df[numeric_cols].isnull().values.astype(np.float32)
# Combine: [normalized_values | null_indicators]
node_features = np.concatenate([numeric_feats, null_mask], axis=1)
data["customer"].x = torch.tensor(node_features, dtype=torch.float32)Always add null indicator features. Models that zero-fill without indicators lose 2-5% accuracy on real-world data with missing values.
Step 3: Edge construction
Build edge_index tensors from foreign keys. The critical detail: indices must be local to each node type, not global. If you have 100K customers and 50K products, customer indices run 0 to 99,999 and product indices run 0 to 49,999 independently.
# Map global IDs to local indices
customer_id_map = {gid: i for i, gid in enumerate(customers_df["id"])}
product_id_map = {gid: i for i, gid in enumerate(products_df["id"])}
# Build edge_index from foreign keys
src = [customer_id_map[cid] for cid in orders_df["customer_id"]]
dst = [product_id_map[pid] for pid in orders_df["product_id"]]
data["customer", "purchased", "product"].edge_index = torch.tensor(
[src, dst], dtype=torch.long
)Step 4: Leakage audit
The most dangerous production pitfall. Feature leakage means your graph contains information that would not be available at prediction time. Common sources:
- Target-derived columns: A “total_spend” column that includes the order you are trying to predict.
- Future edges: Including orders that happen after the prediction timestamp.
- Aggregated statistics: Precomputed averages that incorporate the target period.
Choosing the right layer for heterogeneous data
Once you have a HeteroData object, you need a layer that respects node and edge types:
- HeteroConv: Wraps any homogeneous layer (SAGEConv, GATConv) with per-edge-type instances. Most flexible, easiest to start with.
- RGCNConv: Per-relation weight matrices. Good for knowledge graphs with many relation types but few node types.
- HGTConv: Full heterogeneous graph transformer with type-specific attention. Best accuracy but highest compute cost.
What breaks in production
Three issues that only appear at scale:
- Schema changes: A new column or table in your database requires rebuilding the graph and retraining the model. This makes weekly model refreshes fragile.
- ID mapping drift: If customer IDs are not stable (reindexed, deleted), your edge_index becomes stale. You need a stable mapping layer.
- Memory at scale: A database with 100M rows across 10 tables produces a graph that does not fit in GPU memory. You need neighbor sampling (see the neighbor sampling guide).