What is a graph neural network (GNN)?

A graph neural network is a type of neural network designed to operate directly on graph-structured data. Instead of requiring a flat table as input, GNNs take nodes (entities) and edges (relationships) and learn representations by passing messages along the graph structure. Each node updates its representation by aggregating information from its neighbors, allowing the network to capture relational patterns that flat-table models cannot see.

How does message passing work in GNNs?

Message passing is the core mechanism of GNNs. In each layer, every node collects 'messages' (learned representations) from its neighboring nodes, aggregates them using a function like sum or mean, and combines the result with its own features to produce an updated representation. After k layers of message passing, each node's representation encodes information from all nodes within k hops. This is how GNNs capture multi-hop relational patterns automatically.

Why are GNNs better than XGBoost for enterprise data?

Enterprise data stored in relational databases is inherently graph-structured: tables are node types, foreign keys are edges. XGBoost requires this structure to be flattened into a single table, destroying multi-hop relationships and temporal sequences. GNNs operate on the graph directly. On the RelBench benchmark, GNNs achieved 75.83 AUROC versus 62.44 for LightGBM with manual feature engineering, a 13-point improvement from preserving relational structure.

What are the limitations of graph neural networks?

Standard GNNs have three main limitations: over-smoothing (node representations become indistinguishable after too many message-passing layers), limited expressiveness (they cannot distinguish certain non-isomorphic graphs), and scalability challenges on graphs with billions of edges. Graph transformers address the expressiveness issue by replacing local message passing with global attention. Foundation models like KumoRFM address all three through pre-training at scale.

What is the difference between a GNN and a graph transformer?

A standard GNN uses local message passing where each node only aggregates information from direct neighbors. A graph transformer applies attention mechanisms across broader graph neighborhoods, allowing each node to attend to distant nodes weighted by learned relevance. Graph transformers overcome the expressiveness limitations of standard GNNs and avoid over-smoothing. KumoRFM uses a temporal graph transformer architecture for this reason.

Graph Neural Networks: The Missing Piece in Enterprise ML | Kumo.ai

Pull up the schema diagram for any enterprise database. You will see tables connected by foreign keys: customers linked to orders, orders linked to products, products linked to categories, categories linked to suppliers. Draw it on a whiteboard and it looks like a graph. It is a graph. It has always been a graph.

For 30 years, machine learning ignored this structure. Every ML model from logistic regression to XGBoost requires a flat input table: one row per entity, one column per feature. To use these models on relational data, you must flatten the graph into a spreadsheet. That flattening step is called feature engineering, and it destroys the very relationships that make relational data valuable.

Graph neural networks are the first ML architecture designed to operate on graph-structured data directly. They read nodes, edges, and their attributes, and they learn representations that capture the relational patterns between entities. Applied to enterprise databases, they eliminate the flattening step and access signals that flat-table models structurally cannot reach.

Graphs are everywhere (you just flatten them)

The word "graph" in the ML context does not mean a bar chart. It means a data structure with nodes (entities) and edges (relationships). Social networks are graphs (users connected by friendships). Transaction networks are graphs (accounts connected by payments). And relational databases are graphs, though most people do not think of them that way.

Consider an e-commerce database. The customers table has 100,000 rows. Each customer places orders (orders table, 2 million rows). Each order contains items (order_items, 8 million rows). Each item references a product (products, 50,000 rows). Each product belongs to a category. Each product has reviews from other customers.

This is a heterogeneous graph with 5 node types and millions of edges. A single customer is connected to dozens of orders, hundreds of products, thousands of reviews from other customers, and those customers' own purchase histories. The graph encodes who bought what, when, and what other people who bought the same things did next.

Traditional ML cannot see this structure. To predict churn for a customer, a data scientist computes aggregations: total_orders_30d, avg_order_value, days_since_last_order. These features compress the rich graph neighborhood into a handful of numbers. The topology, the temporal sequences, the multi-hop connections: all gone.

How GNNs work: message passing explained

The core idea behind graph neural networks is simple: let each node learn from its neighbors. The mechanism is called message passing, and it works in layers.

To make this concrete, consider a small e-commerce graph. We will trace what Customer Alice “knows” after each layer of message passing.

customers

customer_id	name	segment	lifetime_value
C-10	Alice	Premium	$4,200
C-11	Bob	Standard	$890
C-12	Carol	Churned	$1,400

orders

order_id	customer_id	product_id	date	amount
O-50	C-10	P-30	2025-10-01	$89
O-51	C-11	P-30	2025-10-05	$89
O-52	C-11	P-31	2025-10-12	$45
O-53	C-12	P-30	2025-09-15	$89
O-54	C-12	P-31	2025-09-20	$45

Alice and Bob both bought product P-30. Bob and Carol both bought P-30 and P-31. Alice and Carol are connected through P-30 despite never interacting directly.

products

product_id	name	category	return_rate
P-30	Wireless Earbuds Pro	Electronics	18%
P-31	Phone Case Deluxe	Accessories	3%

Highlighted: P-30 has an 18% return rate. This quality signal will propagate to Alice through the graph.

Layer 1: immediate neighbors

Each node starts with its own features (the columns from its database row). In the first message-passing layer, each node collects the features of its direct neighbors. After layer 1, Alice's representation now encodes: her own attributes (Premium, $4,200 LTV) plus information about her order O-50 and product P-30 (Wireless Earbuds, 18% return rate).

Layer 2: two-hop neighborhood

The process repeats. Each node again collects representations from its neighbors, but now those neighbor representations already encode 1-hop information. After layer 2, Alice's representation additionally encodes: Bob and Carol also bought P-30. Carol is a churned customer. Bob also bought P-31 (low return rate).

Layer 3: three-hop neighborhood

After three layers, Alice's representation captures: her own purchase of a high-return-rate product (P-30), the fact that Carol (who also bought P-30) churned, and that Bob (who also bought P-30) additionally bought a low-return product. The model now has a multi-hop churn signal for Alice: she bought the same product as a churned customer, and that product has quality issues.

flat_feature_table (what XGBoost sees for Alice)

customer_id	order_count	avg_amount	segment	ltv
C-10	1	$89.00	Premium	$4,200

One row. The flat table shows a Premium customer with one recent order. No indication that Alice shares a problematic product with a churned customer. The multi-hop churn signal is invisible.

Layer k: k-hop neighborhood

After k layers, each node's representation captures patterns from all nodes within k hops. With 3-4 layers, a customer node can encode information about: their orders, the products they bought, the reviews of those products by other customers, and the churn behavior of those other customers. This is the same 4-hop path that no data scientist would manually engineer as a feature.

The evidence: GNNs vs flat-table models

The RelBench benchmark, published at NeurIPS 2024, provides the most rigorous comparison between GNNs and traditional approaches on relational data. It includes 7 real-world databases (Amazon products, Stack Overflow, clinical trials, and others), 30 prediction tasks, and over 103 million rows.

A Stanford-trained data scientist manually engineered features for each task, then trained LightGBM. A graph neural network was trained directly on the relational structure with no manual feature engineering. The results were decisive.

Approach	AUROC (classification)	Human effort
LightGBM + manual features	62.44	12.3 hours, 878 lines per task
GNN on relational graph	75.83	Zero feature engineering
KumoRFM (foundation model)	76.71 (zero-shot)	Zero training, zero features
KumoRFM (fine-tuned)	81.14	Minimal fine-tuning

The GNN outperformed manual feature engineering on 11 of 12 classification tasks. The gap is not marginal. A 13-point AUROC improvement is the difference between a model that barely beats random and one that drives real business decisions.

Real-world applications

Fraud detection: seeing the ring

Fraud rarely happens in isolation. A single fraudulent transaction might look normal. But zoom out and the fraudster shares a device fingerprint with 5 other accounts, those accounts share shipping addresses with 12 more, and 8 of those accounts were created in the same 48-hour window. This is a fraud ring, and it is invisible to flat-table models because the signal exists in the graph topology.

Here is what a fraud ring looks like in the raw data. Each table looks normal on its own. The pattern only emerges when you follow the edges.

accounts

account_id	name	created	email_domain	status
ACC-771	Jordan Blake	2025-11-02	gmail.com	Active
ACC-772	Taylor Reed	2025-11-03	protonmail.com	Active
ACC-773	Casey Morales	2025-11-03	yahoo.com	Active
ACC-774	Morgan Park	2025-11-04	outlook.com	Suspended

Four accounts created within 48 hours. Different names, different email providers. Nothing suspicious in isolation.

devices

device_id	fingerprint	account_id	first_seen
D-901	fp_8a3bc21d	ACC-771	2025-11-02
D-902	fp_8a3bc21d	ACC-772	2025-11-03
D-903	fp_8a3bc21d	ACC-773	2025-11-03
D-904	fp_8a3bc21d	ACC-774	2025-11-04

Highlighted: all four accounts share the same device fingerprint (fp_8a3bc21d). This is a single person operating four accounts.

transactions

txn_id	account_id	amount	merchant	date
T-5501	ACC-771	$2,400	ElectroMart	2025-11-15
T-5502	ACC-772	$1,850	ElectroMart	2025-11-15
T-5503	ACC-773	$3,100	TechDirect	2025-11-16
T-5504	ACC-774	$2,750	TechDirect	2025-11-16

Coordinated high-value purchases across two days. Each transaction is within normal limits. The ring pattern is only visible through the shared device edge.

PQL Query

PREDICT transactions.is_fraud
FOR EACH accounts.account_id

The model traverses the graph: account to device to other accounts sharing that device. It discovers the ring topology without anyone defining 'shared_device_count' as a feature.

Output

account_id	fraud_probability	top_signal
ACC-771	0.92	Shared device with 3 flagged accounts
ACC-772	0.89	Same device cluster, coordinated timing
ACC-773	0.91	Device ring + high-value burst pattern
ACC-774	0.95	Already suspended, device links confirm ring

GNNs propagate suspicion through the graph. A flagged account increases the risk score of connected accounts, which increases the risk of accounts connected to those. PayPal reported using GNN-based approaches for fraud detection at scale. Industry benchmarks show improvements from 43% to 76% detection rates when moving from flat-table models to graph-aware approaches on fraud tasks, because the relational structure encodes the fraud ring pattern that individual transaction features miss.

Recommendations: beyond collaborative filtering

Traditional recommendation systems use collaborative filtering: find users who liked similar items, recommend what they liked. This works for popular items with lots of interaction data. It fails for new items (cold start) and niche items (data sparsity).

GNNs solve this by propagating information through the full interaction graph. A new product has no purchase history, but it has attributes: category, brand, price range, description. The GNN connects it to similar products and propagates their interaction patterns. DoorDash applied this approach and reported a 1.8% engagement lift across 30 million users. At DoorDash's scale, that translates to millions of additional interactions per month.

Churn prediction: the behavioral graph

A customer's churn risk depends on more than their own behavior. It depends on the behavior of customers like them. If 40% of customers who bought the same product combination churned within 60 days, that product combination is a churn indicator. The signal is in the graph: customer → orders → products → orders (others) → customer status.

GNNs learn this pattern automatically through message passing. No human needs to hypothesize that product combinations predict churn. The model discovers it from the structure.

Supply chain risk

A manufacturer's risk profile depends on its suppliers, their suppliers, and so on. When a Taiwan Semiconductor fab goes offline, the impact ripples through 3-4 tiers of the supply chain. GNNs model this propagation directly, predicting which downstream products will face delays based on the graph topology of the supply network.

Limitations of standard GNNs

GNNs are not a complete solution on their own. They have three well-documented limitations that drove the next evolution.

Over-smoothing

As you add more message-passing layers, node representations become increasingly similar. After 6-8 layers, all nodes in a connected component can converge to nearly identical representations. This limits the effective depth of standard GNNs to 2-4 layers, which means they struggle to capture patterns beyond 4 hops.

Limited expressiveness

The Weisfeiler-Leman (WL) test defines the theoretical ceiling for GNN expressiveness. Standard message-passing GNNs are bounded by the 1-WL test, meaning there are graph structures they provably cannot distinguish. In practice, this means certain relational patterns are invisible to standard GNNs.

Scalability

Enterprise graphs can have billions of edges. Full-batch training on these graphs exceeds GPU memory. Mini-batch training with neighbor sampling introduces approximation errors. Scaling GNNs to production-size enterprise databases requires careful engineering.

Standard GNN limitations

Over-smoothing after 4-6 layers limits depth
Bounded by 1-WL expressiveness
Scalability challenges on billion-edge graphs
Trained from scratch for each task
Requires task-specific labeled data

Graph transformer / foundation model

Attention mechanism avoids over-smoothing
Global attention exceeds WL expressiveness bound
Pre-training amortizes compute across tasks
Zero-shot or few-shot predictions on new databases
Learns universal relational patterns from diverse data

From GNNs to graph transformers to foundation models

Graph transformers replace the local message-passing mechanism with global self-attention. Instead of aggregating only from direct neighbors, each node attends to a broader set of nodes, weighted by learned relevance. This overcomes the over-smoothing problem and the expressiveness ceiling.

The key paper was Relational Deep Learning (ICML 2024, Robinson, Fey, et al.), which showed that representing a relational database as a temporal heterogeneous graph and training a graph transformer on it outperforms both manual feature engineering and standard GNNs.

KumoRFM extends this by pre-training a graph transformer on billions of rows across thousands of diverse relational databases. The model learns universal patterns: recency effects, frequency signals, temporal dynamics, graph propagation. At inference, it generalizes to new databases it has never seen.

The trajectory mirrors what happened in NLP. Word2Vec (2013) showed that learned representations beat hand-crafted features. LSTMs and attention mechanisms (2014-2017) improved sequence modeling. Transformers (2017) enabled pre-training at scale. GPT (2018-2023) showed that a single pre-trained model could generalize to new tasks.

For graphs, the sequence is similar: GNNs showed that learned representations beat hand-crafted features. Graph transformers improved expressiveness. Foundation models (KumoRFM) showed that a single pre-trained model can generalize to new relational databases. The difference is that while GPT operates on text sequences, KumoRFM operates on the graph structure of relational databases, which is the native format of enterprise data.

If your ML pipeline starts by flattening a graph into a table, you are paying a tax in both time (feature engineering) and accuracy (information loss). GNNs and their descendants remove that tax. The database is the model input. The graph is the feature.

Key Takeaways

1Every relational database is a graph. Rows are nodes, foreign keys are edges. GNNs are the first ML architecture that operates on this structure natively.
2Message passing lets each node learn from its k-hop neighborhood. After 3-4 layers, a customer node encodes information about orders, products, reviews, and other customers' behavior.
3On RelBench, GNNs achieved 75.83 AUROC versus 62.44 for LightGBM with manual features, a 13-point improvement from preserving relational structure.
4Graph-based fraud detection finds ring patterns invisible to transaction-level models: shared devices, coordinated timing, and synthetic identity clusters spanning multiple tables.
5Graph transformers and foundation models (KumoRFM) overcome standard GNN limitations: over-smoothing, expressiveness bounds, and scalability challenges, achieving 76.71 AUROC zero-shot.

Graph Neural Networks: The Missing Piece in Enterprise ML