CNNs and GNNs share the same core idea, local aggregation with shared weights, but apply it to fundamentally different data structures. A CNN convolves a fixed-size kernel over a regular pixel grid where every pixel has exactly the same neighborhood structure. A GNN aggregates information from a variable-size neighborhood on an irregular graph where node A might have 3 neighbors and node B might have 300.
Regular vs irregular structure
Images: the regular grid
An image is a 2D grid. Every interior pixel has exactly 8 neighbors (3x3 neighborhood). The spatial relationships are fixed: the pixel above is always above, the pixel to the right is always to the right. This regularity enables:
- Fixed-size kernels: a 3x3 kernel has exactly 9 weights, applied identically everywhere
- Translation equivariance: the same pattern is detected anywhere in the image
- Efficient computation: convolution on grids can be implemented as matrix multiplication or FFT, highly optimized on GPUs
Graphs: irregular topology
A graph has no fixed neighborhood structure. In a social network, one user might have 5 friends and another might have 5,000. There is no spatial layout. This irregularity requires:
- Variable-size aggregation: the “kernel” must handle any number of neighbors. Permutation-invariant functions (sum, mean, max) replace fixed kernel weights.
- Permutation invariance: the order of neighbors does not matter (unlike the fixed up/down/left/right of pixels)
- Sparse computation: only connected node pairs interact, using sparse matrix operations
The convolution analogy
Both architectures perform local aggregation:
- CNN: new_pixel = sum(kernel_weight * neighbor_pixel for each fixed neighbor position)
- GNN: new_node = aggregate(transform(neighbor_feature) for each neighbor)
The difference is how “neighbor” is defined. For CNNs, neighbors are defined by grid position (up, down, left, right, diagonals). For GNNs, neighbors are defined by edge connections in the graph. The CNN kernel has position-specific weights (one weight for the pixel above, a different weight for the pixel to the right). The GNN transformation is shared across all neighbors (because there is no fixed spatial relationship).
Why you cannot use CNNs on graphs
Three fundamental problems:
- Variable degree: a 3x3 CNN kernel assumes 8 neighbors. Node A has 3 neighbors; node B has 300. No fixed kernel size works.
- No spatial ordering: a CNN kernel assigns weight_1 to the top-left pixel, weight_2 to the top-center, etc. Graph neighbors have no such ordering. Which neighbor gets weight_1?
- Permutation sensitivity: relabeling graph nodes changes the adjacency matrix but not the graph. CNNs on adjacency matrices would produce different outputs for the same graph.
Why you should not use GNNs on images
While GNNs can process images (treat pixels as nodes on a grid graph), they should not. The regular grid structure of images enables optimizations that GNNs cannot exploit:
- CNNs use dense tensor operations optimized for GPU parallelism
- CNNs exploit translation equivariance (the same kernel slides across the image)
- CNNs use well-developed architectures (ResNet, EfficientNet) with decades of optimization
Using a GNN on an image would be slower and less effective because it ignores the valuable regular structure.
When each architecture applies
- Images, video, audio spectrograms: CNN (regular grid structure)
- Molecules, proteins: GNN (atoms/residues with variable bonds/contacts)
- Social networks, citation networks: GNN (users/papers with variable connections)
- Relational databases: GNN (rows with foreign key connections)
- Point clouds, meshes: GNN (irregular 3D structure)
- Road networks, circuits: GNN (nodes with variable degree)