Message passing is the core operation of graph neural networks (GNNs). Each node in a graph updates its feature vector by collecting information from its direct neighbors, combining it with its own features, and producing a new representation. One round of message passing propagates information one hop. Stacking multiple layers lets each node see further into the graph.
This is the mechanism that allows GNNs to learn from relational structure. When your enterprise database has customers linked to orders linked to products linked to categories, message passing is how a customer node absorbs information from its entire relational neighborhood without manual feature engineering.
The three steps
Every message passing layer performs three operations:
Step 1: Message
Each neighbor j of node i computes a message. In the simplest case (GCNConv), the message is just the neighbor's feature vector. In more advanced layers (GATConv), the message is weighted by a learned attention score.
Step 2: Aggregate
Node i collects all incoming messages from its neighbors and combines them into a single vector. The aggregation must be permutation-invariant (the order of neighbors should not matter). Common choices:
- Sum: captures neighborhood size. GINConv uses this for maximum expressiveness.
- Mean: normalizes by degree. GCNConv uses degree-weighted mean.
- Max: captures the most extreme signal. GraphSAGE supports this.
Step 3: Update
Node i combines the aggregated neighbor information with its own features and passes the result through a neural network (typically a linear layer + activation function) to produce its updated representation.
# Pseudocode for one message passing layer
for each node i in graph:
# Step 1: Collect messages from neighbors
messages = [message_fn(h[j]) for j in neighbors(i)]
# Step 2: Aggregate messages
agg = aggregate(messages) # sum, mean, or max
# Step 3: Update node representation
h[i] = update_fn(h[i], agg) # combine self + neighborsThe core loop. In practice, PyG vectorizes this over all nodes simultaneously using sparse matrix operations.
Concrete example: a customer in a relational database
Consider an e-commerce database with three tables: customers, orders, and products. Represented as a graph:
- Nodes: each customer, order, and product is a node
- Edges: customer → order (placed), order → product (contains)
- Features: customer (age, location), order (amount, date), product (price, category)
Here is what happens during 2 layers of message passing for customer “Alice”:
Layer 1 (1-hop)
Alice's order nodes send their features (amount, date) to Alice. Alice aggregates these messages. After layer 1, Alice's representation encodes: “I am a customer who placed 3 orders averaging $67, most recently 5 days ago.”
Layer 2 (2-hop)
Now Alice's orders have already absorbed product information (from layer 1 of the product nodes). When they send messages to Alice in layer 2, those messages include product-level signals. After layer 2, Alice's representation encodes: “I am a customer who bought electronics and books, with a preference for premium brands, and my most recent order included a product that 40% of other buyers returned.”
PyG implementation
In PyTorch Geometric, all GNN layers inherit from the MessagePassing base class. Here is a minimal custom message passing layer:
import torch
from torch_geometric.nn import MessagePassing
from torch_geometric.utils import add_self_loops, degree
class SimpleMP(MessagePassing):
def __init__(self, in_channels, out_channels):
super().__init__(aggr='add') # "add" aggregation
self.lin = torch.nn.Linear(in_channels, out_channels)
def forward(self, x, edge_index):
# Add self-loops so nodes include their own features
edge_index, _ = add_self_loops(edge_index, num_nodes=x.size(0))
# Compute degree normalization
row, col = edge_index
deg = degree(col, x.size(0), dtype=x.dtype)
deg_inv_sqrt = deg.pow(-0.5)
norm = deg_inv_sqrt[row] * deg_inv_sqrt[col]
# Start message passing
return self.lin(self.propagate(edge_index, x=x, norm=norm))
def message(self, x_j, norm):
# x_j: features of neighbor nodes
# norm: degree normalization weights
return norm.view(-1, 1) * x_jThis is essentially GCNConv implemented from scratch. PyG's built-in GCNConv adds optimizations but the logic is identical.
The three steps map directly to PyG methods:
message()defines Step 1: what each neighbor sendsaggr='add'in the constructor defines Step 2: how messages combineself.lin()afterpropagate()defines Step 3: the update
How different layers vary the recipe
Every GNN layer is a message passing variant. The differences are in what gets sent, how it gets combined, and what transformation is applied:
- GCNConv: message = neighbor features, aggregate = degree-normalized sum, update = linear transform. The simplest layer.
- GATConv: message = neighbor features weighted by learned attention scores, aggregate = weighted sum, update = linear transform. Lets the model learn which neighbors matter more.
- SAGEConv: message = sampled neighbor features, aggregate = mean/max/LSTM, update = concatenate self + aggregate, then linear. Designed for large graphs with neighbor sampling.
- GINConv: message = neighbor features, aggregate = sum (not mean), update = MLP. Achieves maximum expressiveness equal to the Weisfeiler-Leman graph isomorphism test.
Why this matters for enterprise predictions
Enterprise relational databases are graphs. Every foreign key is an edge. Every row is a node. Message passing on this graph structure discovers patterns that flat-table ML systematically misses:
- Multi-hop patterns: A customer's fraud risk depends on the fraud history of merchants they share with other customers. That is a 3-hop path. Message passing traverses it automatically.
- Structural patterns: A group of accounts that only transact with each other (a ring) looks different from accounts with diverse counterparties. Message passing captures graph topology.
- Temporal patterns: With time-aware message passing, nodes only receive messages from events that happened before the prediction timestamp. No data leakage.
On the RelBench benchmark (7 databases, 30 tasks, 103 million rows), models using message passing on the relational graph achieve 75.83 AUROC vs 62.44 for flat-table LightGBM. KumoRFM, which uses a production-grade graph transformer (an evolution of message passing with attention), achieves 76.71 zero-shot and 81.14 fine-tuned.
The limitations (and what comes next)
Standard message passing has three well-known limitations:
- Over-smoothing: After 5-6 layers, all node representations converge because information gets averaged too many times. On Cora, accuracy peaks at 2-3 layers (~81.5%) and drops to ~30% at 8 layers.
- Over-squashing: Information from distant nodes must pass through bottleneck edges, getting compressed into fixed-size vectors. Signal degrades exponentially with distance.
- Expressiveness ceiling: Standard message passing is bounded by the 1-WL graph isomorphism test. Certain graph structures that humans can distinguish (e.g., 6-cycles vs two 3-cycles) are indistinguishable to message passing.
Graph transformers (like TransformerConv and GPSConv) solve all three by allowing every node to attend directly to every other node in the subgraph, bypassing the hop-by-hop bottleneck. KumoRFM's Relational Graph Transformer is built on this architecture.