How is graph convolution related to image convolution?

An image pixel has a fixed set of neighbors (3x3 grid). A graph node has a variable number of neighbors. Image convolution applies the same fixed-size kernel to every pixel's grid neighborhood. Graph convolution aggregates features from a node's variable-size neighborhood and applies a shared transformation. The underlying principle is the same (local aggregation + shared weights), but graph convolution generalizes to irregular structure.

Why can't CNNs process graphs directly?

CNNs require a regular grid structure: every pixel has the same number of neighbors in the same spatial arrangement. Graphs are irregular: node A might have 3 neighbors, node B might have 100. There is no fixed spatial arrangement. Graph convolution handles this irregularity through permutation-invariant aggregation (sum, mean, max) instead of fixed kernel weights.

Can you represent a graph as an image?

You could represent the adjacency matrix as an image and apply CNNs, but this loses: (1) permutation invariance (reordering nodes changes the image but not the graph), (2) sparsity efficiency (the adjacency matrix is mostly zeros), and (3) scalability (a 1M-node graph would require a 1M x 1M image). Graph representations are fundamentally better for graph-structured data.

Is a CNN a special case of a GNN?

Yes. An image is a graph where each pixel is a node connected to its grid neighbors. A CNN is a GNN with a fixed neighborhood structure (the grid), fixed aggregation weights (the kernel), and weight sharing across all positions. GNNs generalize CNNs by removing the regular grid assumption, making them applicable to any graph structure.

GNN vs CNN: Graphs vs Images, Irregular vs Regular Structure | Kumo.ai

CNNs and GNNs share the same core idea, local aggregation with shared weights, but apply it to fundamentally different data structures. A CNN convolves a fixed-size kernel over a regular pixel grid where every pixel has exactly the same neighborhood structure. A GNN aggregates information from a variable-size neighborhood on an irregular graph where node A might have 3 neighbors and node B might have 300.

Regular vs irregular structure

Images: the regular grid

An image is a 2D grid. Every interior pixel has exactly 8 neighbors (3x3 neighborhood). The spatial relationships are fixed: the pixel above is always above, the pixel to the right is always to the right. This regularity enables:

Fixed-size kernels: a 3x3 kernel has exactly 9 weights, applied identically everywhere
Translation equivariance: the same pattern is detected anywhere in the image
Efficient computation: convolution on grids can be implemented as matrix multiplication or FFT, highly optimized on GPUs

Graphs: irregular topology

A graph has no fixed neighborhood structure. In a social network, one user might have 5 friends and another might have 5,000. There is no spatial layout. This irregularity requires:

Variable-size aggregation: the “kernel” must handle any number of neighbors. Permutation-invariant functions (sum, mean, max) replace fixed kernel weights.
Permutation invariance: the order of neighbors does not matter (unlike the fixed up/down/left/right of pixels)
Sparse computation: only connected node pairs interact, using sparse matrix operations

The convolution analogy

Both architectures perform local aggregation:

CNN: new_pixel = sum(kernel_weight * neighbor_pixel for each fixed neighbor position)
GNN: new_node = aggregate(transform(neighbor_feature) for each neighbor)

The difference is how “neighbor” is defined. For CNNs, neighbors are defined by grid position (up, down, left, right, diagonals). For GNNs, neighbors are defined by edge connections in the graph. The CNN kernel has position-specific weights (one weight for the pixel above, a different weight for the pixel to the right). The GNN transformation is shared across all neighbors (because there is no fixed spatial relationship).

Why you cannot use CNNs on graphs

Three fundamental problems:

Variable degree: a 3x3 CNN kernel assumes 8 neighbors. Node A has 3 neighbors; node B has 300. No fixed kernel size works.
No spatial ordering: a CNN kernel assigns weight_1 to the top-left pixel, weight_2 to the top-center, etc. Graph neighbors have no such ordering. Which neighbor gets weight_1?
Permutation sensitivity: relabeling graph nodes changes the adjacency matrix but not the graph. CNNs on adjacency matrices would produce different outputs for the same graph.

Why you should not use GNNs on images

While GNNs can process images (treat pixels as nodes on a grid graph), they should not. The regular grid structure of images enables optimizations that GNNs cannot exploit:

CNNs use dense tensor operations optimized for GPU parallelism
CNNs exploit translation equivariance (the same kernel slides across the image)
CNNs use well-developed architectures (ResNet, EfficientNet) with decades of optimization

Using a GNN on an image would be slower and less effective because it ignores the valuable regular structure.

When each architecture applies

Images, video, audio spectrograms: CNN (regular grid structure)
Molecules, proteins: GNN (atoms/residues with variable bonds/contacts)
Social networks, citation networks: GNN (users/papers with variable connections)
Relational databases: GNN (rows with foreign key connections)
Point clouds, meshes: GNN (irregular 3D structure)
Road networks, circuits: GNN (nodes with variable degree)

Key Takeaways

1CNNs operate on regular grids (fixed neighborhood, fixed spatial arrangement). GNNs operate on irregular graphs (variable neighborhood, no spatial arrangement). Both perform local aggregation with shared weights.
2A CNN is a special case of a GNN: an image is a grid graph with fixed connectivity. GNNs generalize CNNs by removing the regular structure assumption.
3CNNs cannot process graphs because they require fixed-size neighborhoods with spatial ordering. GNNs solve this with permutation-invariant aggregation over variable-size neighborhoods.
4GNNs should not process images: the regular grid structure enables CNN optimizations (dense tensor ops, translation equivariance) that GNNs cannot exploit.
5Use CNNs for regular data (images, video, audio). Use GNNs for irregular data (networks, molecules, relational databases, meshes). Match the architecture to the data structure.

GNN vs CNN: Graphs vs Images, Irregular vs Regular Structure