Real-world graphs are sparse: they contain a tiny fraction of all possible edges. A complete graph on 1 million nodes has 500 billion edges. A real social network with 1 million users typically has 10-50 million edges, less than 0.01% of the theoretical maximum. This sparsity is not a data quality issue. It reflects the fundamental constraint that real entities connect to a small number of other entities.
Why graphs are sparse
Physical, social, and structural constraints limit connections:
- Social networks: Dunbar's number (~150 meaningful relationships). Even the most connected users follow thousands, not millions.
- Molecules: carbon has at most 4 bonds. Most atoms bond to 1-4 neighbors. A molecule with 50 atoms has ~55 bonds, not 1,225 (the complete graph).
- Road networks: intersections connect to 3-4 roads. A city with 10,000 intersections has ~20,000 road segments.
- Databases: each row has a fixed number of foreign keys (typically 1-5). A table with 1 million rows has 1-5 million FK edges, not a trillion.
Computational advantage of sparsity
Sparsity is why GNNs scale to large graphs:
- Message passing cost: O(|E|) per layer, not O(N^2). For a graph with 1M nodes and 10M edges, this is 10M operations vs 1 trillion.
- Memory: PyG stores graphs as sparse
edge_indextensors (two arrays of length |E|). A 1M-node graph with 10M edges needs ~80 MB, not 4 TB (for a dense adjacency matrix). - GPU efficiency: sparse operations (scatter, gather) are well-optimized on modern GPUs.
Information limitation of sparsity
Sparsity means many node pairs are far apart. In a sparse graph with average degree 20, a 2-layer GNN gives each node access to roughly 400 neighbors (20^2). On a graph with 1 million nodes, that is 0.04% of the graph. The remaining 99.96% is invisible.
This creates three problems:
Limited receptive field
With 2-3 layers, each node sees only its local neighborhood. If the prediction depends on global graph properties (the overall community structure, the total graph size, the existence of a distant cluster), the GNN cannot capture it.
Over-squashing
Information from a distant node must travel through multiple hops, getting compressed at each hop into fixed-size vectors. After 4-5 hops through bottleneck edges, the original signal is effectively lost. This is the over-squashing problem: the sparse graph's bottleneck topology limits information transmission.
Disconnected components
Very sparse graphs may have disconnected components: groups of nodes with no path between them. Information cannot flow across disconnected components through message passing. Nodes in different components are completely invisible to each other.
Addressing sparsity limitations
- Graph transformers: replace sparse message passing with dense global attention. Every node attends to every other node in one layer. Solves over-squashing and limited receptive field, but costs O(N^2). Feasible for small graphs.
- Virtual nodes: add a single “virtual” node connected to all real nodes. This provides a global communication channel: information from any node reaches any other node in 2 hops (through the virtual node).
- Graph rewiring: add edges based on computed similarity or random connections. Reduces graph diameter and mitigates over-squashing at the cost of increased edge count.
- Positional encodings: encode each node's global position (via Laplacian eigenvectors or random walk statistics) as input features. This gives the GNN global structural information without global attention.
The sparsity-expressiveness tradeoff
GNN architecture design is fundamentally about navigating the sparsity-expressiveness tradeoff:
- Pure sparse (GCN, GAT): scale to billions of nodes but limited to local patterns
- Pure dense (transformers): capture global patterns but limited to thousands of nodes
- Hybrid (GPS, KumoRFM): combine sparse local message passing with dense global attention on subgraphs, balancing scalability and expressiveness