Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Guide7 min read

Skip Connections: Residual Connections That Fight Over-Smoothing

Skip connections add a layer's input directly to its output, preserving node-specific features as information flows through multiple GNN layers. They are the simplest and most effective technique for building deeper GNNs.

PyTorch Geometric

TL;DR

  • 1Skip connections add h_input to h_output at each GNN layer, preserving early-layer features alongside aggregated neighbor information. Cost: one addition operation.
  • 2They fight over-smoothing by maintaining access to less-smoothed representations from earlier layers. The model learns to balance local (skip) and aggregated (GNN) signal.
  • 3Extend practical GNN depth from 2-3 layers to 6-8 layers. This enables capturing 6-8 hop patterns in enterprise relational graphs.
  • 4Three types: additive (h + GNN(h)), concatenation ([h || GNN(h)]), and gated (alpha * GNN(h) + (1-alpha) * h). Additive is simplest and most common.
  • 5Always use them. The computational cost is negligible. Performance consistently improves or stays the same. There is no downside.

Skip connections in graph neural networks add a layer's input directly to its output (h_new = GNN_layer(h) + h), preserving early-layer node-specific features that would otherwise be smoothed away by repeated neighbor aggregation. Borrowed from ResNets in computer vision, skip connections are the simplest and most effective technique for building deeper GNNs. Without them, performance peaks at 2-3 layers. With them, practical depth extends to 6-8 layers, enabling GNNs to capture longer-range patterns in enterprise relational data.

Why it matters for enterprise data

Enterprise relational databases contain patterns that span multiple hops. A customer's fraud risk depends on merchants 3-4 hops away. A product's demand depends on supplier reliability 4-5 hops away. Capturing these patterns requires deeper GNNs, and skip connections are what make deeper GNNs practical.

Without skip connections, a 2-layer GNN can only see 2 hops. With skip connections enabling 6 layers, the model reaches 6 hops into the relational graph, capturing patterns across customer → order → product → category → product → order → customer chains.

How skip connections work

skip_connections.py
import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv

class ResidualGCN(torch.nn.Module):
    """GCN with additive skip connections."""
    def __init__(self, in_dim, hidden_dim, out_dim, num_layers=6):
        super().__init__()
        self.input_proj = torch.nn.Linear(in_dim, hidden_dim)
        self.convs = torch.nn.ModuleList(
            [GCNConv(hidden_dim, hidden_dim) for _ in range(num_layers)]
        )
        self.output = torch.nn.Linear(hidden_dim, out_dim)

    def forward(self, x, edge_index):
        x = self.input_proj(x)  # project to hidden_dim

        for conv in self.convs:
            identity = x              # save input for skip connection
            x = conv(x, edge_index)   # message passing
            x = F.relu(x)
            x = x + identity          # SKIP CONNECTION: add input back
            x = F.dropout(x, p=0.3, training=self.training)

        return self.output(x)

# Without skip connections: accuracy drops after layer 3
# With skip connections: accuracy stable through layer 6-8

One line (x = x + identity) makes the difference between 2-3 usable layers and 6-8. The skip connection preserves the node's own information through multiple rounds of aggregation.

Three types of skip connections

Additive (residual)

h_new = GNN(h) + h. Requires input and output to have the same dimensions. Simplest and most common. Used in DeepGCN, GCNII.

Concatenation

h_new = Linear([GNN(h) || h]). Concatenates the GNN output with the input and projects back to hidden dimensions. Does not require matching dimensions. Used in JK-Net (Jumping Knowledge Networks).

Gated

h_new = alpha * GNN(h) + (1-alpha) * h, where alpha is a learned scalar or vector. The model learns how much neighbor information to incorporate at each layer. Most flexible but adds parameters.

Concrete example: deeper churn detection on a social graph

A telecom company wants to capture social influence in churn prediction. The social influence chain is: customer A → friend B → friend C → friend D (3 hops of social contagion). Without skip connections, a 3-layer GCN suffers from smoothing at layer 3. With skip connections, a 4-layer GCN captures the full influence chain:

  • Without skip (2-layer GCN): captures 2-hop influence. AUROC: 72.1
  • Without skip (4-layer GCN): over-smoothing hurts. AUROC: 68.3
  • With skip (4-layer GCN): captures 4-hop influence. AUROC: 75.4

The 3-point AUROC improvement from skip connections represents thousands of correctly predicted churns in a 50M customer base.

Limitations and what comes next

  1. Depth ceiling remains: Skip connections extend depth from 2-3 to 6-8 layers, but not to 50+. Beyond 8 layers, over-smoothing still degrades the aggregated component, even though the skip preserves early features.
  2. Does not solve over-squashing: Skip connections preserve local features but do not create new information paths. Over-squashing from bottleneck edges requires graph rewiring.
  3. Lazy learning risk: If the model relies entirely on skip connections (ignoring neighbor aggregation), it degenerates to an MLP. Proper initialization and training prevent this.

Graph transformers combine skip connections with global attention, enabling arbitrarily deep models without over-smoothing. KumoRFM's architecture uses both local skip connections and global attention for maximum depth and range.

Frequently asked questions

What are skip connections in GNNs?

Skip connections (also called residual connections) add a layer's input directly to its output: h_new = GNN_layer(h) + h. This preserves the original node features alongside the aggregated neighbor information. In deep GNNs, skip connections prevent over-smoothing by maintaining access to early-layer representations that contain more node-specific (less smoothed) information.

How do skip connections fight over-smoothing?

Over-smoothing occurs because repeated aggregation averages away local differences. Skip connections add the pre-aggregation features back to the output at every layer. Even if the aggregated component becomes smooth after many layers, the skip connection preserves the original node-specific signal. The model can learn to weight the skip path more heavily when aggregation stops being helpful.

How much deeper can GNNs go with skip connections?

Without skip connections, GCN typically peaks at 2-3 layers. With skip connections, practical depth extends to 6-8 layers on most tasks. Beyond 8 layers, even skip connections cannot fully prevent over-smoothing, though performance degrades more gradually. For comparison, ResNets in computer vision go to 100+ layers with skip connections.

What types of skip connections work for GNNs?

Three main types: (1) Additive residual: h_new = GNN(h) + h (requires same dimensions). (2) Concatenation: h_new = [GNN(h) || h] followed by a linear projection. (3) Gated: h_new = alpha * GNN(h) + (1-alpha) * h where alpha is learned. Additive is simplest and most common. Gated connections give the model fine-grained control over how much neighbor information to incorporate.

When should I use skip connections in enterprise GNN models?

Always. Skip connections have almost no computational cost (one addition operation) and consistently improve or maintain performance. They are especially important when: (1) you need more than 2 GNN layers (multi-hop enterprise patterns), (2) your graph is dense (faster smoothing), or (3) you are using GCNConv (most susceptible to smoothing). There is essentially no reason not to use them.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.