Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Guide10 min read

Message Passing: How Graph Neural Networks Actually Work

Message passing is the single operation that makes graph neural networks different from every other neural network. Every GNN layer, from GCN to GAT to GraphSAGE, is a variation of the same idea: let nodes talk to their neighbors.

PyTorch Geometric

TL;DR

  • 1Message passing is a three-step operation: each neighbor sends a message, the node aggregates all messages (sum, mean, or max), then updates its own representation. One layer = one hop of information.
  • 2Stack 2 GCNConv layers and each node sees its 2-hop neighborhood. This is how a customer node learns from its orders, which learned from their products. Multi-table patterns emerge automatically.
  • 3Different GNN layers are different message passing variants. GCN averages messages. GAT weights them by attention. GIN sums them for maximum expressiveness. The framework is the same.
  • 4Over-smoothing limits depth to 2-3 layers in practice. After 5-6 layers, all nodes converge to the same representation. Graph transformers solve this with direct long-range attention.
  • 5In PyG, all GNN layers inherit from MessagePassing. You define message(), aggregate(), and update(). The framework handles batching, sparse operations, and GPU acceleration.

Message passing is the core operation of graph neural networks (GNNs). Each node in a graph updates its feature vector by collecting information from its direct neighbors, combining it with its own features, and producing a new representation. One round of message passing propagates information one hop. Stacking multiple layers lets each node see further into the graph.

This is the mechanism that allows GNNs to learn from relational structure. When your enterprise database has customers linked to orders linked to products linked to categories, message passing is how a customer node absorbs information from its entire relational neighborhood without manual feature engineering.

The three steps

Every message passing layer performs three operations:

Step 1: Message

Each neighbor j of node i computes a message. In the simplest case (GCNConv), the message is just the neighbor's feature vector. In more advanced layers (GATConv), the message is weighted by a learned attention score.

Step 2: Aggregate

Node i collects all incoming messages from its neighbors and combines them into a single vector. The aggregation must be permutation-invariant (the order of neighbors should not matter). Common choices:

  • Sum: captures neighborhood size. GINConv uses this for maximum expressiveness.
  • Mean: normalizes by degree. GCNConv uses degree-weighted mean.
  • Max: captures the most extreme signal. GraphSAGE supports this.

Step 3: Update

Node i combines the aggregated neighbor information with its own features and passes the result through a neural network (typically a linear layer + activation function) to produce its updated representation.

message_passing_pseudocode.py
# Pseudocode for one message passing layer
for each node i in graph:
    # Step 1: Collect messages from neighbors
    messages = [message_fn(h[j]) for j in neighbors(i)]

    # Step 2: Aggregate messages
    agg = aggregate(messages)  # sum, mean, or max

    # Step 3: Update node representation
    h[i] = update_fn(h[i], agg)  # combine self + neighbors

The core loop. In practice, PyG vectorizes this over all nodes simultaneously using sparse matrix operations.

Concrete example: a customer in a relational database

Consider an e-commerce database with three tables: customers, orders, and products. Represented as a graph:

  • Nodes: each customer, order, and product is a node
  • Edges: customer → order (placed), order → product (contains)
  • Features: customer (age, location), order (amount, date), product (price, category)

Here is what happens during 2 layers of message passing for customer “Alice”:

Layer 1 (1-hop)

Alice's order nodes send their features (amount, date) to Alice. Alice aggregates these messages. After layer 1, Alice's representation encodes: “I am a customer who placed 3 orders averaging $67, most recently 5 days ago.”

Layer 2 (2-hop)

Now Alice's orders have already absorbed product information (from layer 1 of the product nodes). When they send messages to Alice in layer 2, those messages include product-level signals. After layer 2, Alice's representation encodes: “I am a customer who bought electronics and books, with a preference for premium brands, and my most recent order included a product that 40% of other buyers returned.”

PyG implementation

In PyTorch Geometric, all GNN layers inherit from the MessagePassing base class. Here is a minimal custom message passing layer:

custom_mp_layer.py
import torch
from torch_geometric.nn import MessagePassing
from torch_geometric.utils import add_self_loops, degree

class SimpleMP(MessagePassing):
    def __init__(self, in_channels, out_channels):
        super().__init__(aggr='add')  # "add" aggregation
        self.lin = torch.nn.Linear(in_channels, out_channels)

    def forward(self, x, edge_index):
        # Add self-loops so nodes include their own features
        edge_index, _ = add_self_loops(edge_index, num_nodes=x.size(0))

        # Compute degree normalization
        row, col = edge_index
        deg = degree(col, x.size(0), dtype=x.dtype)
        deg_inv_sqrt = deg.pow(-0.5)
        norm = deg_inv_sqrt[row] * deg_inv_sqrt[col]

        # Start message passing
        return self.lin(self.propagate(edge_index, x=x, norm=norm))

    def message(self, x_j, norm):
        # x_j: features of neighbor nodes
        # norm: degree normalization weights
        return norm.view(-1, 1) * x_j

This is essentially GCNConv implemented from scratch. PyG's built-in GCNConv adds optimizations but the logic is identical.

The three steps map directly to PyG methods:

  • message() defines Step 1: what each neighbor sends
  • aggr='add' in the constructor defines Step 2: how messages combine
  • self.lin() after propagate() defines Step 3: the update

How different layers vary the recipe

Every GNN layer is a message passing variant. The differences are in what gets sent, how it gets combined, and what transformation is applied:

  • GCNConv: message = neighbor features, aggregate = degree-normalized sum, update = linear transform. The simplest layer.
  • GATConv: message = neighbor features weighted by learned attention scores, aggregate = weighted sum, update = linear transform. Lets the model learn which neighbors matter more.
  • SAGEConv: message = sampled neighbor features, aggregate = mean/max/LSTM, update = concatenate self + aggregate, then linear. Designed for large graphs with neighbor sampling.
  • GINConv: message = neighbor features, aggregate = sum (not mean), update = MLP. Achieves maximum expressiveness equal to the Weisfeiler-Leman graph isomorphism test.

Why this matters for enterprise predictions

Enterprise relational databases are graphs. Every foreign key is an edge. Every row is a node. Message passing on this graph structure discovers patterns that flat-table ML systematically misses:

  • Multi-hop patterns: A customer's fraud risk depends on the fraud history of merchants they share with other customers. That is a 3-hop path. Message passing traverses it automatically.
  • Structural patterns: A group of accounts that only transact with each other (a ring) looks different from accounts with diverse counterparties. Message passing captures graph topology.
  • Temporal patterns: With time-aware message passing, nodes only receive messages from events that happened before the prediction timestamp. No data leakage.

On the RelBench benchmark (7 databases, 30 tasks, 103 million rows), models using message passing on the relational graph achieve 75.83 AUROC vs 62.44 for flat-table LightGBM. KumoRFM, which uses a production-grade graph transformer (an evolution of message passing with attention), achieves 76.71 zero-shot and 81.14 fine-tuned.

The limitations (and what comes next)

Standard message passing has three well-known limitations:

  1. Over-smoothing: After 5-6 layers, all node representations converge because information gets averaged too many times. On Cora, accuracy peaks at 2-3 layers (~81.5%) and drops to ~30% at 8 layers.
  2. Over-squashing: Information from distant nodes must pass through bottleneck edges, getting compressed into fixed-size vectors. Signal degrades exponentially with distance.
  3. Expressiveness ceiling: Standard message passing is bounded by the 1-WL graph isomorphism test. Certain graph structures that humans can distinguish (e.g., 6-cycles vs two 3-cycles) are indistinguishable to message passing.

Graph transformers (like TransformerConv and GPSConv) solve all three by allowing every node to attend directly to every other node in the subgraph, bypassing the hop-by-hop bottleneck. KumoRFM's Relational Graph Transformer is built on this architecture.

Frequently asked questions

What is message passing in graph neural networks?

Message passing is the fundamental operation in graph neural networks where each node updates its feature vector by collecting (aggregating) information from its direct neighbors, then combining it with its own features through a learnable transformation. One round of message passing lets each node see one hop away. Stacking multiple layers lets nodes see further: 2 layers = 2 hops, 3 layers = 3 hops.

How does message passing work step by step?

Three steps per layer: (1) Message: each neighbor computes a message from its current features, (2) Aggregate: the node collects all incoming messages using a permutation-invariant function (sum, mean, or max), (3) Update: the node combines the aggregated message with its own features through a neural network to produce its new representation.

What is the difference between message passing and graph convolution?

Graph convolution (GCNConv) is one specific instance of message passing where the message function is identity, the aggregation is degree-normalized sum, and the update is a linear transformation. Message passing is the general framework that encompasses GCN, GAT, GraphSAGE, GIN, and all other spatial GNN layers. In PyG, all convolutional layers inherit from the MessagePassing base class.

Why does message passing work on relational enterprise data?

Enterprise databases are relational: customers link to orders, orders link to products, products link to categories. When you represent this as a graph (rows = nodes, foreign keys = edges), message passing lets each entity learn from its full relational context. A customer node receives messages from its order nodes, which received messages from product nodes. This captures multi-table patterns that flat-table ML cannot see.

What are the limitations of message passing?

Three main limitations: (1) Over-smoothing: after 5-6 layers, all node representations converge to the same value because information gets averaged too many times. (2) Over-squashing: information from distant nodes gets compressed into fixed-size vectors, losing signal. (3) Expressiveness ceiling: standard message passing cannot distinguish certain graph structures (bounded by the Weisfeiler-Leman graph isomorphism test). Graph transformers address all three.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.