Original Paper
Semi-Supervised Classification with Graph Convolutional Networks
Thomas N. Kipf, Max Welling (2016). ICLR 2017
Read paper →What GCNConv does
GCNConv performs one step of neighborhood aggregation. For each node in the graph, it:
- Collects the feature vectors of all neighboring nodes (including itself via self-loop)
- Computes a weighted sum using symmetric degree normalization: each neighbor j's contribution is scaled by 1/sqrt(deg(i) * deg(j))
- This normalization prevents high-degree nodes from dominating
- Applies a learnable linear transformation (weight matrix)
That is the entire operation. Stack two GCNConv layers and each node has information from its 2-hop neighborhood. Stack three and it sees 3 hops. The model learns which combinations of neighbor features are predictive for the downstream task.
The math (simplified)
For node i with neighbors N(i):
h_i' = W · ( Σ (1 / √(deg(i) · deg(j))) · h_j )
for all j in N(i) ∪ {i}
Where:
h_i = current feature vector of node i
h_i' = updated feature vector after this layer
W = learnable weight matrix (the only parameters)
deg() = node degree (number of connections)
N(i) = neighbors of node iThe symmetric normalization (1/√(deg·deg)) prevents nodes with many connections from dominating. The self-loop ({i}) ensures the node retains its own features.
PyG implementation
In PyTorch Geometric, GCNConv is a single import and a single line in your model:
import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv
class GCN(torch.nn.Module):
def __init__(self, in_channels, hidden_channels, out_channels):
super().__init__()
self.conv1 = GCNConv(in_channels, hidden_channels)
self.conv2 = GCNConv(hidden_channels, out_channels)
def forward(self, x, edge_index):
# Layer 1: aggregate 1-hop neighbors
x = self.conv1(x, edge_index)
x = F.relu(x)
x = F.dropout(x, p=0.5, training=self.training)
# Layer 2: aggregate 2-hop neighbors
x = self.conv2(x, edge_index)
return x
# Usage on Cora dataset
from torch_geometric.datasets import Planetoid
dataset = Planetoid(root='/tmp/Cora', name='Cora')
data = dataset[0]
model = GCN(dataset.num_features, 16, dataset.num_classes)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
# Training loop
model.train()
for epoch in range(200):
optimizer.zero_grad()
out = model(data.x, data.edge_index)
loss = F.cross_entropy(out[data.train_mask], data.y[data.train_mask])
loss.backward()
optimizer.step()Complete GCN training on the Cora citation dataset. 15 lines of model code. The entire training loop fits on one screen.
When to use GCNConv
GCNConv is the right choice when:
- You want a baseline. Start with GCN, measure performance, then try more complex layers to see if they help. Many practitioners skip this step and go straight to GAT, wasting time on unnecessary complexity.
- Your graph is homogeneous and undirected. All nodes are the same type, all edges are the same type, and edges go both ways. Citation networks, social networks, and co-purchase graphs fit this pattern.
- Speed matters more than peak accuracy. GCNConv is the fastest GNN layer because it uses simple matrix multiplication with no attention computation. On large graphs (millions of nodes), this speed advantage compounds.
- You need 2-3 hops of context. With 2 layers, GCNConv captures 2-hop patterns efficiently. If your task requires local neighborhood information (node classification, community detection), this is often sufficient.
When to move beyond GCNConv
GCNConv has three structural limitations that become apparent on real enterprise data:
1. Over-smoothing (depth limit)
Each GCNConv layer averages neighbor features. Stack too many layers and all nodes end up with nearly identical representations. On Cora, accuracy peaks at 2-3 layers (~81%) and drops to ~30% at 8 layers. The entire graph gets “smoothed” into one uniform representation.
Fix: GCN2Conv adds skip connections that preserve each node's original features through deep layers. GPS/ TransformerConv use attention to selectively weight information.
2. Equal neighbor treatment
GCNConv weights neighbors only by degree (how many connections they have). It cannot learn that some neighbors are more important than others for a specific task. In fraud detection, a transaction to a known-suspicious merchant should carry far more weight than a routine grocery purchase. GCNConv treats them identically.
Fix: GATConv and GATv2Conv learn attention weights per edge, automatically down-weighting irrelevant neighbors and up-weighting informative ones.
3. Homogeneous assumption
GCNConv assumes all nodes and edges are the same type. Enterprise relational databases have multiple table types (customers, orders, products) connected by different relationship types (purchased, reviewed, returned). GCNConv applies the same transformation to all of them.
Fix: RGCNConv, HGTConv, and HeteroConv handle heterogeneous graphs with type-specific transformations per node and edge type.
Benchmark performance
On standard benchmarks, GCNConv is competitive but not state-of-the-art. Here is where it stands:
- Cora (citation, 2,708 nodes): ~81.5% accuracy. GAT: ~83.0%. GCN2Conv: ~82.5%. TransformerConv: ~83.2%.
- CiteSeer (citation, 3,327 nodes): ~70.3% accuracy. GAT: ~72.5%.
- PubMed (citation, 19,717 nodes): ~79.0% accuracy. GAT: ~79.0% (tied at this scale).
- Reddit (social, 232K nodes): GCNConv is competitive here because the graph is large and homogeneous, which is GCN's sweet spot.
On heterogeneous or enterprise-scale graphs (where KumoRFM operates), the gap widens significantly. KumoRFM's Relational Graph Transformer achieves 76.71 AUROC on RelBench vs 62.44 for flat-table LightGBM, a gap that simple GCNConv cannot close because it lacks heterogeneous support and attention mechanisms.
How KumoRFM builds on this
KumoRFM's architecture is a direct descendant of GCNConv. The core insight (aggregate neighbor information, apply a transformation, stack layers) is the same. But where GCNConv uses fixed averaging with one weight matrix, KumoRFM's Relational Graph Transformer uses:
- Learned attention instead of fixed averaging (from TransformerConv)
- Type-specific transformations for different tables and relationships (from HGTConv / RGCNConv)
- Temporal encodings so the model knows when events happened, not just that they happened
- Schema-agnostic encoding so it works on any database without architecture changes
The result: you get the accuracy of a state-of-the-art graph transformer without writing any PyG code. One line of PQL replaces the model definition, training loop, and inference pipeline.