Graph partitioning is the foundation of distributed graph neural network training. A social network with 100 million users, a financial transaction graph with a billion edges, an enterprise database with 50 million rows: these graphs exceed the memory of any single machine. Partitioning splits them into pieces that fit on individual machines while preserving as much local graph structure as possible.
The partitioning problem
Given a graph G with N nodes and E edges, partition it into k subgraphs such that:
- Balance: each partition has approximately N/k nodes (no partition is much larger than others)
- Minimum edge cut: as few edges as possible cross partition boundaries
These objectives often conflict. Perfectly balanced partitions may cut many edges. Minimizing edge cuts may create imbalanced partitions (one large community stays together). Algorithms find practical tradeoffs.
Why edge cuts matter for GNNs
During message passing, each node aggregates from its neighbors. If a neighbor is in a different partition (on a different machine), the message must be communicated over the network. This inter-machine communication is orders of magnitude slower than local memory access.
Minimizing edge cuts minimizes inter-machine communication. If 95% of edges are within partitions, only 5% of messages require network communication.
METIS: the standard algorithm
METIS uses a three-phase multilevel approach:
- Coarsening: iteratively collapse pairs of connected nodes into super-nodes, reducing the graph size by 50% each round. After 5-10 rounds, the graph has thousands (not millions) of nodes.
- Partitioning: partition the small coarsened graph using a simple algorithm (e.g., recursive bisection). This is fast because the graph is small.
- Uncoarsening: expand super-nodes back to original nodes, refining the partition at each level using local search (swap nodes between partitions to reduce edge cuts).
METIS runs in O(|E|) time and produces high-quality partitions. It is the default choice for graph partitioning in most GNN frameworks.
Halo expansion
Nodes at partition boundaries lose some neighbors (those in other partitions). For a 2-layer GNN, boundary nodes need their 2-hop neighborhood to be complete. Halo expansion addresses this:
- For each partition, include all nodes within k hops of the partition boundary (where k = number of GNN layers)
- These “halo nodes” are stored redundantly (they exist in their home partition AND as halos in adjacent partitions)
- During message passing, halo nodes provide the full neighborhood context for boundary nodes
The tradeoff: halo expansion increases storage (redundant node copies) and requires halo synchronization (updating halo node embeddings between layers). But it eliminates the accuracy loss from partition boundaries.
Partitioning strategies beyond METIS
- Random partitioning: assign nodes randomly. Simple but high edge cut. Useful as a baseline.
- Hash-based partitioning: partition by hashing node IDs. Simple, deterministic, but ignores structure.
- Community-based: detect communities first (Louvain), then assign each community to a partition. Low edge cut but potentially imbalanced.
- Streaming partitioning: for graphs that arrive as edge streams, assign nodes as they arrive. Does not require the full graph in memory.