Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Layer10 min read

GCNConv: The Default GNN Layer (And When to Move Beyond It)

GCNConv is the most widely used graph neural network layer. It is also the one most teams should start with and eventually outgrow. Here is how it works, where it shines, and where it hits a wall.

PyTorch Geometric

TL;DR

  • 1GCNConv computes a weighted sum of neighbor features (plus the node itself) using symmetric degree normalization 1/sqrt(deg_i * deg_j), then applies a linear transform. One line of PyG code. The simplest GNN layer and the right starting point for most tasks.
  • 2Use 2-3 layers max. Each layer propagates information one hop. Beyond 3-4 layers, over-smoothing makes all node representations converge to the same value.
  • 3GCNConv assumes homogeneous, undirected graphs with equally important neighbors. When neighbors vary in importance (fraud detection, recommendations), switch to GATConv or TransformerConv.
  • 4On Cora (the standard citation benchmark), GCNConv achieves ~81% accuracy. GAT gets ~83%. The gap widens on heterogeneous and large-scale graphs.
  • 5KumoRFM uses production-grade graph transformer layers (descendants of GCNConv) that handle heterogeneous relational data, temporal dynamics, and billion-node graphs automatically.

Original Paper

Semi-Supervised Classification with Graph Convolutional Networks

Thomas N. Kipf, Max Welling (2016). ICLR 2017

Read paper →

What GCNConv does

GCNConv performs one step of neighborhood aggregation. For each node in the graph, it:

  1. Collects the feature vectors of all neighboring nodes (including itself via self-loop)
  2. Computes a weighted sum using symmetric degree normalization: each neighbor j's contribution is scaled by 1/sqrt(deg(i) * deg(j))
  3. This normalization prevents high-degree nodes from dominating
  4. Applies a learnable linear transformation (weight matrix)

That is the entire operation. Stack two GCNConv layers and each node has information from its 2-hop neighborhood. Stack three and it sees 3 hops. The model learns which combinations of neighbor features are predictive for the downstream task.

The math (simplified)

For node i with neighbors N(i):

GCNConv formula
h_i' = W · ( Σ (1 / √(deg(i) · deg(j))) · h_j )
       for all j in N(i) ∪ {i}

Where:
  h_i   = current feature vector of node i
  h_i'  = updated feature vector after this layer
  W     = learnable weight matrix (the only parameters)
  deg() = node degree (number of connections)
  N(i)  = neighbors of node i

The symmetric normalization (1/√(deg·deg)) prevents nodes with many connections from dominating. The self-loop ({i}) ensures the node retains its own features.

PyG implementation

In PyTorch Geometric, GCNConv is a single import and a single line in your model:

gcn_model.py
import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv

class GCN(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels):
        super().__init__()
        self.conv1 = GCNConv(in_channels, hidden_channels)
        self.conv2 = GCNConv(hidden_channels, out_channels)

    def forward(self, x, edge_index):
        # Layer 1: aggregate 1-hop neighbors
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, p=0.5, training=self.training)

        # Layer 2: aggregate 2-hop neighbors
        x = self.conv2(x, edge_index)
        return x

# Usage on Cora dataset
from torch_geometric.datasets import Planetoid

dataset = Planetoid(root='/tmp/Cora', name='Cora')
data = dataset[0]

model = GCN(dataset.num_features, 16, dataset.num_classes)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

# Training loop
model.train()
for epoch in range(200):
    optimizer.zero_grad()
    out = model(data.x, data.edge_index)
    loss = F.cross_entropy(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()

Complete GCN training on the Cora citation dataset. 15 lines of model code. The entire training loop fits on one screen.

When to use GCNConv

GCNConv is the right choice when:

  • You want a baseline. Start with GCN, measure performance, then try more complex layers to see if they help. Many practitioners skip this step and go straight to GAT, wasting time on unnecessary complexity.
  • Your graph is homogeneous and undirected. All nodes are the same type, all edges are the same type, and edges go both ways. Citation networks, social networks, and co-purchase graphs fit this pattern.
  • Speed matters more than peak accuracy. GCNConv is the fastest GNN layer because it uses simple matrix multiplication with no attention computation. On large graphs (millions of nodes), this speed advantage compounds.
  • You need 2-3 hops of context. With 2 layers, GCNConv captures 2-hop patterns efficiently. If your task requires local neighborhood information (node classification, community detection), this is often sufficient.

When to move beyond GCNConv

GCNConv has three structural limitations that become apparent on real enterprise data:

1. Over-smoothing (depth limit)

Each GCNConv layer averages neighbor features. Stack too many layers and all nodes end up with nearly identical representations. On Cora, accuracy peaks at 2-3 layers (~81%) and drops to ~30% at 8 layers. The entire graph gets “smoothed” into one uniform representation.

Fix: GCN2Conv adds skip connections that preserve each node's original features through deep layers. GPS/ TransformerConv use attention to selectively weight information.

2. Equal neighbor treatment

GCNConv weights neighbors only by degree (how many connections they have). It cannot learn that some neighbors are more important than others for a specific task. In fraud detection, a transaction to a known-suspicious merchant should carry far more weight than a routine grocery purchase. GCNConv treats them identically.

Fix: GATConv and GATv2Conv learn attention weights per edge, automatically down-weighting irrelevant neighbors and up-weighting informative ones.

3. Homogeneous assumption

GCNConv assumes all nodes and edges are the same type. Enterprise relational databases have multiple table types (customers, orders, products) connected by different relationship types (purchased, reviewed, returned). GCNConv applies the same transformation to all of them.

Fix: RGCNConv, HGTConv, and HeteroConv handle heterogeneous graphs with type-specific transformations per node and edge type.

Benchmark performance

On standard benchmarks, GCNConv is competitive but not state-of-the-art. Here is where it stands:

  • Cora (citation, 2,708 nodes): ~81.5% accuracy. GAT: ~83.0%. GCN2Conv: ~82.5%. TransformerConv: ~83.2%.
  • CiteSeer (citation, 3,327 nodes): ~70.3% accuracy. GAT: ~72.5%.
  • PubMed (citation, 19,717 nodes): ~79.0% accuracy. GAT: ~79.0% (tied at this scale).
  • Reddit (social, 232K nodes): GCNConv is competitive here because the graph is large and homogeneous, which is GCN's sweet spot.

On heterogeneous or enterprise-scale graphs (where KumoRFM operates), the gap widens significantly. KumoRFM's Relational Graph Transformer achieves 76.71 AUROC on RelBench vs 62.44 for flat-table LightGBM, a gap that simple GCNConv cannot close because it lacks heterogeneous support and attention mechanisms.

How KumoRFM builds on this

KumoRFM's architecture is a direct descendant of GCNConv. The core insight (aggregate neighbor information, apply a transformation, stack layers) is the same. But where GCNConv uses fixed averaging with one weight matrix, KumoRFM's Relational Graph Transformer uses:

  • Learned attention instead of fixed averaging (from TransformerConv)
  • Type-specific transformations for different tables and relationships (from HGTConv / RGCNConv)
  • Temporal encodings so the model knows when events happened, not just that they happened
  • Schema-agnostic encoding so it works on any database without architecture changes

The result: you get the accuracy of a state-of-the-art graph transformer without writing any PyG code. One line of PQL replaces the model definition, training loop, and inference pipeline.

Frequently asked questions

What is GCNConv in PyTorch Geometric?

GCNConv is the implementation of the Graph Convolutional Network layer from Kipf & Welling (2016) in PyTorch Geometric. It performs a single step of neighborhood aggregation: each node's representation is updated by averaging the features of its neighbors (plus itself), followed by a linear transformation. It is the most widely used GNN layer and the default starting point for graph learning tasks.

When should I use GCNConv vs GATConv?

Use GCNConv when your graph is homogeneous, undirected, and you want a fast, simple baseline. Use GATConv when neighbor importance varies (e.g., in fraud detection, where suspicious connections carry more signal than routine ones). GATConv learns attention weights per edge, which adds expressiveness at the cost of ~2-3x more computation.

How many GCNConv layers should I stack?

Typically 2-3 layers. Each layer lets information propagate one hop further in the graph. With 2 layers, each node sees its 2-hop neighborhood. Beyond 3-4 layers, GCN suffers from over-smoothing: all node representations converge to the same value, destroying discriminative signal. Use skip connections (GCN2Conv) or attention (GATConv) if you need deeper models.

What is the difference between GCNConv and GraphConv?

GCNConv uses symmetric normalization (dividing by the square root of both source and target degrees) and a single shared weight matrix. GraphConv (from Morris et al., 2018) uses separate weight matrices for the node itself and its neighbors, giving it slightly more expressiveness. In practice, the accuracy difference is small, but GraphConv is more flexible for heterogeneous settings.

Can GCNConv handle directed graphs?

GCNConv was designed for undirected graphs. Its symmetric normalization 1/sqrt(deg_i * deg_j) assumes edges are bidirectional. On directed graphs, this normalization becomes ill-defined. For directed graphs, consider SAGEConv (which uses separate self/neighbor transforms), GATConv (which computes attention per directed edge), or DirGNNConv (which explicitly separates incoming from outgoing messages).

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.