Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Guide8 min read

Neighborhood Aggregation: How GNNs Collect Information from Graph Neighbors

Neighborhood aggregation is the single most important operation in graph neural networks. It determines what information each node can see and how that information is compressed into a fixed-size vector.

PyTorch Geometric

TL;DR

  • 1Neighborhood aggregation collects feature vectors from all direct neighbors and combines them into one vector using a permutation-invariant function (sum, mean, or max).
  • 2Sum aggregation preserves neighborhood size and is the most expressive. Mean normalizes by degree. Max captures dominant signals. The choice directly affects what patterns the GNN can learn.
  • 3On enterprise relational data, aggregation over foreign-key edges replaces manual SQL GROUP BY operations. A customer node aggregating its order neighbors automatically computes spending patterns.
  • 4Stacking aggregation layers extends reach: 1 layer = 1 hop, 2 layers = 2 hops. Most enterprise tasks peak at 2-3 layers before over-smoothing degrades performance.
  • 5In PyG, aggregation is set via the aggr parameter in the MessagePassing constructor. Built-in layers like GCNConv, GATConv, and GINConv each use a different aggregation strategy.

Neighborhood aggregation is the operation by which each node in a graph neural network collects feature vectors from its direct neighbors and combines them into a single fixed-size representation. It is the mechanism that distinguishes GNNs from standard neural networks. Without aggregation, nodes would never learn from graph structure. Every GNN layer, from GCN to GAT to GIN, implements aggregation differently, and that choice determines what the model can and cannot learn.

Why it matters for enterprise data

Enterprise relational databases are collections of tables linked by foreign keys. Customers link to orders. Orders link to products. Products link to categories. When you represent this as a graph, each foreign key becomes an edge, and neighborhood aggregation becomes the mechanism that propagates information across tables.

Consider a churn prediction task. A flat-table approach requires a data scientist to manually write SQL aggregations: average order value per customer, count of returns in the last 30 days, most frequent product category. Neighborhood aggregation computes these patterns automatically. The customer node aggregates its order neighbors, and the learned aggregation discovers which patterns matter for predicting churn.

On the RelBench benchmark, GNNs using learned neighborhood aggregation on relational graphs achieve 75.83 AUROC compared to 62.44 for flat-table LightGBM across 30 enterprise tasks.

How it works: three aggregation functions

The aggregation function must be permutation-invariant: the result must be the same regardless of the order in which neighbors are processed. Three functions dominate:

Sum aggregation

Adds all neighbor feature vectors element-wise. A node with 10 neighbors produces a larger aggregated vector than a node with 2. This preserves information about neighborhood size, which is critical for tasks like fraud detection where high-degree nodes (accounts with many transactions) behave differently from low-degree nodes. GINConv uses sum aggregation because it provides maximum expressiveness equal to the Weisfeiler-Leman test.

Mean aggregation

Averages all neighbor vectors. This normalizes by node degree, so a customer with 100 orders and a customer with 5 orders produce comparable representations. GCNConv uses a degree-normalized variant of mean aggregation. Mean works well when the distribution of neighbor features matters more than the count.

Max aggregation

Takes the element-wise maximum across all neighbor vectors. This acts like a filter that captures the most extreme signal in each feature dimension. GraphSAGE supports max aggregation as an option. It is useful when a single outlier neighbor (e.g., one very large transaction) is more informative than the average.

Concrete example: customer spend aggregation

Consider a database with a customers table and an orders table linked by customer_id. Represented as a graph:

  • Customer nodes: features = [age, tenure_months, region]
  • Order nodes: features = [amount, item_count, days_since]
  • Edges: customer → order (placed_by)

Customer “Alice” has 3 orders with amounts [$50, $120, $30]. After one layer of aggregation on the amount dimension:

  • Sum: $200 (total spend)
  • Mean: $66.67 (average order value)
  • Max: $120 (largest single order)

PyG implementation

In PyTorch Geometric, aggregation is controlled by the aggr parameter in the MessagePassing constructor:

aggregation_comparison.py
import torch
from torch_geometric.nn import MessagePassing

class SumAggLayer(MessagePassing):
    """Sum aggregation - preserves neighborhood size."""
    def __init__(self, in_dim, out_dim):
        super().__init__(aggr='add')  # sum aggregation
        self.lin = torch.nn.Linear(in_dim, out_dim)

    def forward(self, x, edge_index):
        return self.lin(self.propagate(edge_index, x=x))

    def message(self, x_j):
        return x_j  # send neighbor features as-is

class MeanAggLayer(MessagePassing):
    """Mean aggregation - normalizes by degree."""
    def __init__(self, in_dim, out_dim):
        super().__init__(aggr='mean')  # mean aggregation
        self.lin = torch.nn.Linear(in_dim, out_dim)

    def forward(self, x, edge_index):
        return self.lin(self.propagate(edge_index, x=x))

    def message(self, x_j):
        return x_j

# PyG also supports: aggr='max', aggr='softmax', aggr='powermean'
# Or use MultiAggregation to combine multiple:
from torch_geometric.nn import MultiAggregation
multi_aggr = MultiAggregation(['sum', 'mean', 'max'])

The only difference between these layers is the aggr parameter. PyG handles the rest.

Limitations and what comes next

Neighborhood aggregation has inherent constraints:

  1. Information loss: Compressing an arbitrary number of neighbor vectors into one fixed-size vector necessarily discards information. A customer with 1,000 orders loses detail that a customer with 5 orders preserves.
  2. Over-smoothing: Repeated aggregation across layers causes all node representations to converge. After 5-6 layers, nodes become indistinguishable. This limits practical depth to 2-3 layers.
  3. Expressiveness bounds: Mean and max aggregation cannot distinguish certain multisets of neighbor features. Only sum aggregation (as in GINConv) achieves maximum expressiveness under the Weisfeiler-Leman test.

Graph rewiring and skip connections mitigate over-smoothing. Graph transformers bypass local aggregation entirely by allowing every node to attend to every other node, removing the neighborhood bottleneck.

Frequently asked questions

What is neighborhood aggregation in graph neural networks?

Neighborhood aggregation is the operation where each node collects feature vectors from all its direct neighbors and combines them into a single vector using a permutation-invariant function like sum, mean, or max. This aggregated vector is then used to update the node's own representation. It is the core mechanism that allows GNNs to learn from graph structure.

Why must aggregation be permutation-invariant?

Graphs have no natural ordering of neighbors. Node A's neighbors {B, C, D} are the same set regardless of how you list them. If aggregation depended on order, the same graph could produce different outputs depending on arbitrary node numbering. Permutation-invariant functions (sum, mean, max) guarantee consistent results regardless of neighbor ordering.

What is the difference between sum, mean, and max aggregation?

Sum preserves information about neighborhood size (a node with 10 neighbors produces a larger vector than one with 2). Mean normalizes by degree, treating high-degree and low-degree nodes equally. Max captures the most extreme feature value, acting like a filter for dominant signals. GINConv uses sum for maximum expressiveness; GCNConv uses degree-normalized mean; GraphSAGE supports all three.

How does neighborhood aggregation apply to enterprise relational data?

In a relational database, foreign keys define edges. When you aggregate a customer node's order neighbors, you are automatically computing features like total spend, average order value, and purchase frequency without writing SQL aggregation queries. Multi-hop aggregation (customer -> orders -> products) captures cross-table patterns that flat-table ML requires manual feature engineering to replicate.

How many hops of aggregation should I use?

Most tasks perform best with 2-3 hops (layers). Each layer lets each node see one hop further. Two layers let a customer see its orders and their products. Beyond 5-6 layers, over-smoothing causes all node representations to converge, destroying useful signal. Graph transformers bypass this limitation with direct long-range attention.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.