Graph Neural Networks (GNNs): Introduction and Examples | Kumo.ai

01

What Are Graphs?

Most interesting phenomena in the world can be broken down into entities, their relationships, and their interactions. Graphs are the mathematical abstraction that captures this structure. A graph consists of nodes (entities) and edges (connections between entities).

This is not a niche data format. Graphs appear everywhere:

Social networks: People are nodes. Friendships, messages, and follows are edges. LinkedIn models people, companies, schools, and skills as a single interconnected graph with billions of nodes.
E-commerce: Users, products, and categories form a graph. Purchases, reviews, and browsing sessions are edges connecting them. Amazon and Alibaba use this structure to power recommendations.
Biology: Proteins, drugs, genes, and pathways interact in complex networks. A single molecule is itself a graph where atoms are nodes and chemical bonds are edges.
Finance: Accounts, transactions, and merchants form a dynamic graph. A single wire transfer is an edge connecting two account nodes, carrying attributes like amount, timestamp, and currency.
Infrastructure: Roads, power grids, supply chains, and communication networks are all graphs connecting physical or logical entities.

Homogeneous vs. heterogeneous graphs

A homogeneous graph has one type of node and one type of edge. Example: an author collaboration network where every node is a researcher and every edge means “co-authored a paper.”

A heterogeneous graph has multiple node types and edge types. This is far more common in practice. A financial graph might have account, merchant, and device nodes connected by transfer, purchase, and login edges. Each node and edge type can carry different attributes.

02

Why Graphs for Machine Learning?

Traditional machine learning models (logistic regression, random forests, gradient-boosted trees) expect a flat table as input: one row per sample, one column per feature. This works well when your data is naturally tabular. But most real-world data is not.

Consider fraud detection. You have a transaction you want to classify as legitimate or fraudulent. The transaction itself has attributes: amount, timestamp, merchant category. But the strongest signals often come from context. Who sent the money? Who received it? What other transactions did those accounts make? What devices were used? How does this pattern compare to the broader network?

To use a flat-table model, you must manually engineer features that capture this context: “number of transactions in the last 24 hours,” “average transaction amount for this merchant,” “number of unique devices used by this account.” Each feature is a hand-crafted summary of a specific neighborhood in the graph. You lose the structure in the process.

Flat-Table ML

Manual feature engineering

+Fast training on single tables
+Well-understood algorithms (XGBoost, etc.)
+Mature tooling and deployment pipelines

−Requires flattening relational data
−Loses multi-hop relationship signals
−Each new task needs new feature engineering
−Cannot capture graph topology

Graph Neural Networks

Learns from structure directly

+Operates on raw graph structure
+Captures multi-hop relationships automatically
+Adapts to variable-size neighborhoods
+Shared architecture across tasks

−Higher computational cost
−Requires graph construction step
−Harder to interpret than decision trees

Five types of graph learning tasks

Graph ML is not limited to classifying individual data points. The structure of the graph enables fundamentally different types of predictions:

Node classification: Predict an attribute of a node. Example: is this account fraudulent? Will this customer churn?
Link prediction: Predict whether an edge should exist between two nodes. Example: will this user purchase this product? Should we recommend this connection?
Graph classification: Classify an entire graph. Example: is this molecule toxic? Will this chemical compound bind to a target protein?
Community detection: Identify clusters of closely-connected nodes. Example: detecting fraud rings or customer segments.
Missing node identification: Discover entities that should exist but are not yet in the graph. Example: predicting undiscovered drug side effects.

03

How GNNs Work: Message Passing

The core mechanism behind GNNs is message passing. The idea is simple: each node updates its representation by collecting information from its neighbors. After multiple rounds of message passing, each node's representation encodes information from an increasingly large neighborhood of the graph.

1

Initialize

Each node starts with its own feature vector (e.g., account balance, age, location).

→

2

Aggregate

Each node collects (aggregates) feature vectors from its direct neighbors.

→

3

Update

Each node combines its own features with the aggregated neighbor information using a learned function.

→

4

Repeat

Steps 2-3 repeat for K layers. After K rounds, each node encodes its K-hop neighborhood.

A concrete example

Imagine a small social network with five users. Alice is connected to Bob and Carol. Bob is connected to Alice and Dave. Carol is connected to Alice and Eve. Each user has features: age, location, activity level.

Layer 1: Alice collects messages from Bob and Carol. Her new representation now encodes not just her own features, but a summary of Bob's and Carol's features.

Layer 2: Alice again collects from Bob and Carol. But now Bob's representation already contains information about Dave (from Layer 1), and Carol's contains information about Eve. So after two layers, Alice's representation indirectly encodes information about Dave and Eve, even though she is not directly connected to them.

The math (simplified)

For a node v at layer k, message passing follows this general pattern:

Message: For each neighbor u of v, compute a message based on u's current features. This might be a simple copy, a linear transformation, or an attention-weighted projection.
Aggregate: Combine all incoming messages into a single vector. Common choices: sum, mean, or max. The aggregation must be permutation-invariant because graphs have no inherent ordering of neighbors.
Update: Combine the aggregated message with node v's own features from the previous layer, typically through a neural network layer (linear transformation + nonlinearity).

04

Types of GNNs

Different GNN architectures vary in how they compute messages and aggregate neighbor information. Each makes different trade-offs between expressiveness, computational cost, and scalability.

Graph Convolutional Networks (GCN)

GCN, introduced by Kipf and Welling in 2017, is the simplest and most widely-taught GNN. Each layer computes a weighted average of neighbor features, where the weights are determined by node degrees. A node connected to many others contributes less per-connection than a node with few connections. This normalization prevents high-degree nodes from dominating the aggregation.

GCN is fast and effective for homogeneous graphs with relatively uniform structure. Its limitation: it treats all neighbors equally (after degree normalization). It cannot learn that some neighbors are more relevant than others for a given prediction.

GraphSAGE (Sample and Aggregate)

GraphSAGE, developed by Hamilton, Ying, and Leskovec in 2017, solves a practical scalability problem. In real graphs, some nodes have millions of neighbors (think of a celebrity on a social network). Aggregating all neighbors is computationally prohibitive.

GraphSAGE samples a fixed number of neighbors at each layer instead of using all of them. It also introduces learnable aggregation functions (mean, LSTM, or pooling) rather than fixed weighted averages. This makes it practical for graphs with billions of nodes, which is why companies like Pinterest and Uber adopted it for production systems.

Graph Attention Networks (GAT)

GAT, introduced by Velickovic et al. in 2018, brings the attention mechanism from Transformers into graph learning. Instead of treating all neighbors equally or weighting them by degree, GAT learns attention scores between every pair of connected nodes. Each neighbor's contribution is weighted by how relevant it is to the target node, and this relevance is learned during training.

For example, in a citation network, GAT can learn that a paper's prediction should attend more strongly to frequently-cited references than to tangential citations. The attention weights are data-dependent and task-specific.

GCN

Simple, degree-normalized

+Easy to implement and understand
+Low computational overhead
+Strong baseline for many tasks

−Treats all neighbors equally
−Struggles with heterogeneous graphs
−Fixed aggregation scheme

GraphSAGE

Scalable, sampling-based

+Scales to billion-node graphs
+Learnable aggregation functions
+Works for inductive learning (new nodes)

−Sampling introduces variance
−Still no per-neighbor weighting
−Hyperparameter: sample size per layer

GAT

Attention-weighted

+Learns which neighbors matter most
+Attention weights are interpretable
+Strong performance on heterogeneous data

−Higher memory cost than GCN
−Attention computation adds overhead
−Can overfit on small graphs

Other notable architectures

GIN (Graph Isomorphism Network): Designed to be maximally expressive. GIN can distinguish graph structures that simpler GNNs (GCN, GraphSAGE) cannot tell apart. It achieves this by using a sum aggregation with learned injective functions, making it as powerful as the Weisfeiler-Leman graph isomorphism test.
MPNN (Message Passing Neural Network): A general framework that unifies most GNN architectures. GCN, GraphSAGE, and GAT are all special cases of the MPNN framework with different message and aggregation functions.
R-GCN (Relational GCN): Extends GCN to handle multiple edge types. Each edge type gets its own weight matrix, allowing the model to learn different transformations for different relationship types. Essential for heterogeneous graphs like knowledge graphs.

05

Real-World Applications

GNNs have moved well beyond academic benchmarks. Here are concrete production deployments with measurable results.

Fraud detection and financial crime

Financial networks are naturally graph-structured: accounts transact with other accounts, share devices, and connect through intermediaries. Fraudsters exploit this by laundering money through chains of accounts or coordinating across seemingly unrelated entities.

A European central bank applied graph ML to their transaction network and improved fraud detection accuracy from 43% to 76%. The key insight: fraud patterns are structural. A suspicious transaction looks normal in isolation but reveals itself when you examine the 2-3 hop neighborhood of accounts involved.

Recommender systems

Recommendation is one of the most successful GNN applications in production. GNNs model user-item interactions as a bipartite graph, where edges capture purchases, views, or ratings. The GNN learns representations that capture behavioral similarity through the graph structure, going far beyond simple collaborative filtering.

Uber Eats, Spotify, Amazon, and Alibaba all deploy GNN-based recommendation systems. The common pattern: user-item interaction graphs capture richer preference signals than collaborative filtering on flat matrices. A user's embedding encodes not just their history, but the histories of behaviorally similar users, propagated through the graph.

Drug discovery and biomedicine

Molecules are graphs where atoms are nodes and chemical bonds are edges. GNNs can predict molecular properties (toxicity, solubility, binding affinity) directly from the molecular graph without requiring hand-crafted molecular fingerprints.

Beyond single molecules, biological knowledge graphs connect drugs, proteins, diseases, genes, and pathways. GNNs on these graphs can predict drug-protein interactions, identify potential side effects, and forecast clinical trial outcomes. The graph structure captures relationships that would be invisible in a flat table of molecular descriptors.

Trust and abuse detection

Airbnb uses GNNs to assess host trustworthiness. The challenge: new hosts have no review history. But they do have connections. Their phone number, email domain, IP address, and payment method link them to a graph of existing users. GNNs propagate trust (or distrust) signals through these connections, enabling risk assessment for accounts with zero historical data.

Natural language processing

Knowledge graphs encode factual relationships: (Paris, capital-of, France), (Einstein, born-in, Ulm). GNNs on knowledge graphs enhance language models with explicit relational knowledge, improving performance on question-answering and commonsense reasoning tasks where pure text models struggle.

Computer vision and 3D understanding

Point clouds from LiDAR sensors, 3D meshes in CAD software, and scene graphs describing object relationships are all naturally graph-structured. GNNs process these without converting to grids or sequences, preserving the geometric and topological information that CNNs on voxel grids would lose.

06

Tools and Implementation

Building GNNs from scratch requires handling irregular data structures (sparse adjacency matrices, variable-size neighborhoods) that standard deep learning frameworks do not natively support. Several libraries solve this.

PyTorch Geometric (PyG)

PyG is the most widely-adopted GNN library, providing implementations of over 80 GNN architectures and access to more than 200 benchmark datasets. It handles the low-level complexity of batching graphs, sampling neighborhoods, and computing sparse message passing efficiently on GPUs.

Major organizations including Spotify, Airbus, AstraZeneca, and Stanford build on PyG. It supports heterogeneous graphs, temporal graphs, and scales to graphs with hundreds of millions of edges.

From research to production

Deploying GNNs in production involves challenges beyond model training. You need to construct the graph from raw data sources, handle dynamic updates as new nodes and edges arrive, serve predictions at low latency, and retrain as the graph evolves. These engineering concerns often dwarf the modeling effort.

Kumo.ai abstracts this entire pipeline. Given relational tables and a prediction target, the platform automatically constructs the graph, selects the architecture, trains the model, and serves predictions. Analysts define what they want to predict using a SQL-like interface rather than writing GNN code.

1

Raw Tables

Relational data in databases or warehouses, connected by foreign keys.

→

2

Graph Construction

Tables become nodes and edges. Foreign keys define the graph topology.

→

3

GNN Training

Message passing learns node representations that encode multi-hop context.

→

4

Prediction

Node classification, link prediction, or graph-level outputs served at scale.

07

Limitations and What Comes Next

GNNs are powerful, but they have well-understood limitations that motivate ongoing research.

Over-smoothing

As you stack more message-passing layers, node representations converge. After 5-6 layers, all nodes in a connected component can end up with nearly identical representations, losing the local information that distinguishes them. This limits GNNs to capturing patterns within a relatively small neighborhood (typically 2-3 hops).

Expressiveness ceiling

Standard message-passing GNNs are bounded by the Weisfeiler-Leman (WL) graph isomorphism test. This means there exist structurally different graphs that no message-passing GNN can distinguish. For example, certain regular graphs (where every node has the same degree and same local structure) produce identical node representations even though the global structures differ.

Scalability trade-offs

Full-batch GNN training requires storing the entire graph in memory, which fails for graphs with billions of edges. Mini-batch training with neighborhood sampling (as in GraphSAGE) introduces variance and can miss important long-range connections. Distributed training adds communication overhead. Each approach makes trade-offs between accuracy, memory, and speed.

The path to graph transformers

Graph transformers address several of these limitations by replacing local message passing with global attention. Instead of aggregating only from direct neighbors, a graph transformer can attend to any node in the graph, weighted by learned relevance scores. This eliminates over-smoothing (no iterative neighborhood averaging) and breaks the WL expressiveness ceiling.

The trade-off is computational cost. Standard self-attention scales quadratically with the number of nodes. Recent research combines graph structure with efficient attention mechanisms: using the graph topology to define sparse attention patterns, encoding positional information from the graph structure, and integrating message passing with attention in hybrid architectures.

KumoRFM, Kumo.ai's foundation model for relational data, builds on this line of research. It uses a graph transformer architecture that operates directly on multi-table relational databases, combining the structural awareness of GNNs with the global reasoning capability of transformers. The result: a single pre-trained model that generalizes across schemas and tasks without per-task feature engineering.

Standard GNNs

Local message passing

+Efficient for sparse graphs
+Well-understood theory
+Many proven architectures

−Over-smoothing with depth
−Bounded by WL test
−Limited receptive field

Graph Transformers

Global attention on graphs

+No over-smoothing
+Beyond WL expressiveness
+Captures long-range dependencies

−Quadratic attention cost
−Requires positional encodings
−Still an active research area