Pull up the schema diagram for any enterprise database. You will see tables connected by foreign keys: customers linked to orders, orders linked to products, products linked to categories, categories linked to suppliers. Draw it on a whiteboard and it looks like a graph. It is a graph. It has always been a graph.
For 30 years, machine learning ignored this structure. Every ML model from logistic regression to XGBoost requires a flat input table: one row per entity, one column per feature. To use these models on relational data, you must flatten the graph into a spreadsheet. That flattening step is called feature engineering, and it destroys the very relationships that make relational data valuable.
Graph neural networks are the first ML architecture designed to operate on graph-structured data directly. They read nodes, edges, and their attributes, and they learn representations that capture the relational patterns between entities. Applied to enterprise databases, they eliminate the flattening step and access signals that flat-table models structurally cannot reach.
Graphs are everywhere (you just flatten them)
The word "graph" in the ML context does not mean a bar chart. It means a data structure with nodes (entities) and edges (relationships). Social networks are graphs (users connected by friendships). Transaction networks are graphs (accounts connected by payments). And relational databases are graphs, though most people do not think of them that way.
Consider an e-commerce database. The customers table has 100,000 rows. Each customer places orders (orders table, 2 million rows). Each order contains items (order_items, 8 million rows). Each item references a product (products, 50,000 rows). Each product belongs to a category. Each product has reviews from other customers.
This is a heterogeneous graph with 5 node types and millions of edges. A single customer is connected to dozens of orders, hundreds of products, thousands of reviews from other customers, and those customers' own purchase histories. The graph encodes who bought what, when, and what other people who bought the same things did next.
Traditional ML cannot see this structure. To predict churn for a customer, a data scientist computes aggregations: total_orders_30d, avg_order_value, days_since_last_order. These features compress the rich graph neighborhood into a handful of numbers. The topology, the temporal sequences, the multi-hop connections: all gone.
How GNNs work: message passing explained
The core idea behind graph neural networks is simple: let each node learn from its neighbors. The mechanism is called message passing, and it works in layers.
To make this concrete, consider a small e-commerce graph. We will trace what Customer Alice “knows” after each layer of message passing.
customers
| customer_id | name | segment | lifetime_value |
|---|---|---|---|
| C-10 | Alice | Premium | $4,200 |
| C-11 | Bob | Standard | $890 |
| C-12 | Carol | Churned | $1,400 |
orders
| order_id | customer_id | product_id | date | amount |
|---|---|---|---|---|
| O-50 | C-10 | P-30 | 2025-10-01 | $89 |
| O-51 | C-11 | P-30 | 2025-10-05 | $89 |
| O-52 | C-11 | P-31 | 2025-10-12 | $45 |
| O-53 | C-12 | P-30 | 2025-09-15 | $89 |
| O-54 | C-12 | P-31 | 2025-09-20 | $45 |
Alice and Bob both bought product P-30. Bob and Carol both bought P-30 and P-31. Alice and Carol are connected through P-30 despite never interacting directly.
products
| product_id | name | category | return_rate |
|---|---|---|---|
| P-30 | Wireless Earbuds Pro | Electronics | 18% |
| P-31 | Phone Case Deluxe | Accessories | 3% |
Highlighted: P-30 has an 18% return rate. This quality signal will propagate to Alice through the graph.
Layer 1: immediate neighbors
Each node starts with its own features (the columns from its database row). In the first message-passing layer, each node collects the features of its direct neighbors. After layer 1, Alice's representation now encodes: her own attributes (Premium, $4,200 LTV) plus information about her order O-50 and product P-30 (Wireless Earbuds, 18% return rate).
Layer 2: two-hop neighborhood
The process repeats. Each node again collects representations from its neighbors, but now those neighbor representations already encode 1-hop information. After layer 2, Alice's representation additionally encodes: Bob and Carol also bought P-30. Carol is a churned customer. Bob also bought P-31 (low return rate).
Layer 3: three-hop neighborhood
After three layers, Alice's representation captures: her own purchase of a high-return-rate product (P-30), the fact that Carol (who also bought P-30) churned, and that Bob (who also bought P-30) additionally bought a low-return product. The model now has a multi-hop churn signal for Alice: she bought the same product as a churned customer, and that product has quality issues.
flat_feature_table (what XGBoost sees for Alice)
| customer_id | order_count | avg_amount | segment | ltv |
|---|---|---|---|---|
| C-10 | 1 | $89.00 | Premium | $4,200 |
One row. The flat table shows a Premium customer with one recent order. No indication that Alice shares a problematic product with a churned customer. The multi-hop churn signal is invisible.
Layer k: k-hop neighborhood
After k layers, each node's representation captures patterns from all nodes within k hops. With 3-4 layers, a customer node can encode information about: their orders, the products they bought, the reviews of those products by other customers, and the churn behavior of those other customers. This is the same 4-hop path that no data scientist would manually engineer as a feature.
The evidence: GNNs vs flat-table models
The RelBench benchmark, published at NeurIPS 2024, provides the most rigorous comparison between GNNs and traditional approaches on relational data. It includes 7 real-world databases (Amazon products, Stack Overflow, clinical trials, and others), 30 prediction tasks, and over 103 million rows.
A Stanford-trained data scientist manually engineered features for each task, then trained LightGBM. A graph neural network was trained directly on the relational structure with no manual feature engineering. The results were decisive.
| Approach | AUROC (classification) | Human effort |
|---|---|---|
| LightGBM + manual features | 62.44 | 12.3 hours, 878 lines per task |
| GNN on relational graph | 75.83 | Zero feature engineering |
| KumoRFM (foundation model) | 76.71 (zero-shot) | Zero training, zero features |
| KumoRFM (fine-tuned) | 81.14 | Minimal fine-tuning |
The GNN outperformed manual feature engineering on 11 of 12 classification tasks. The gap is not marginal. A 13-point AUROC improvement is the difference between a model that barely beats random and one that drives real business decisions.
Real-world applications
Fraud detection: seeing the ring
Fraud rarely happens in isolation. A single fraudulent transaction might look normal. But zoom out and the fraudster shares a device fingerprint with 5 other accounts, those accounts share shipping addresses with 12 more, and 8 of those accounts were created in the same 48-hour window. This is a fraud ring, and it is invisible to flat-table models because the signal exists in the graph topology.
Here is what a fraud ring looks like in the raw data. Each table looks normal on its own. The pattern only emerges when you follow the edges.
accounts
| account_id | name | created | email_domain | status |
|---|---|---|---|---|
| ACC-771 | Jordan Blake | 2025-11-02 | gmail.com | Active |
| ACC-772 | Taylor Reed | 2025-11-03 | protonmail.com | Active |
| ACC-773 | Casey Morales | 2025-11-03 | yahoo.com | Active |
| ACC-774 | Morgan Park | 2025-11-04 | outlook.com | Suspended |
Four accounts created within 48 hours. Different names, different email providers. Nothing suspicious in isolation.
devices
| device_id | fingerprint | account_id | first_seen |
|---|---|---|---|
| D-901 | fp_8a3bc21d | ACC-771 | 2025-11-02 |
| D-902 | fp_8a3bc21d | ACC-772 | 2025-11-03 |
| D-903 | fp_8a3bc21d | ACC-773 | 2025-11-03 |
| D-904 | fp_8a3bc21d | ACC-774 | 2025-11-04 |
Highlighted: all four accounts share the same device fingerprint (fp_8a3bc21d). This is a single person operating four accounts.
transactions
| txn_id | account_id | amount | merchant | date |
|---|---|---|---|---|
| T-5501 | ACC-771 | $2,400 | ElectroMart | 2025-11-15 |
| T-5502 | ACC-772 | $1,850 | ElectroMart | 2025-11-15 |
| T-5503 | ACC-773 | $3,100 | TechDirect | 2025-11-16 |
| T-5504 | ACC-774 | $2,750 | TechDirect | 2025-11-16 |
Coordinated high-value purchases across two days. Each transaction is within normal limits. The ring pattern is only visible through the shared device edge.
PQL Query
PREDICT transactions.is_fraud FOR EACH accounts.account_id
The model traverses the graph: account to device to other accounts sharing that device. It discovers the ring topology without anyone defining 'shared_device_count' as a feature.
Output
| account_id | fraud_probability | top_signal |
|---|---|---|
| ACC-771 | 0.92 | Shared device with 3 flagged accounts |
| ACC-772 | 0.89 | Same device cluster, coordinated timing |
| ACC-773 | 0.91 | Device ring + high-value burst pattern |
| ACC-774 | 0.95 | Already suspended, device links confirm ring |
GNNs propagate suspicion through the graph. A flagged account increases the risk score of connected accounts, which increases the risk of accounts connected to those. PayPal reported using GNN-based approaches for fraud detection at scale. Industry benchmarks show improvements from 43% to 76% detection rates when moving from flat-table models to graph-aware approaches on fraud tasks, because the relational structure encodes the fraud ring pattern that individual transaction features miss.
Recommendations: beyond collaborative filtering
Traditional recommendation systems use collaborative filtering: find users who liked similar items, recommend what they liked. This works for popular items with lots of interaction data. It fails for new items (cold start) and niche items (data sparsity).
GNNs solve this by propagating information through the full interaction graph. A new product has no purchase history, but it has attributes: category, brand, price range, description. The GNN connects it to similar products and propagates their interaction patterns. DoorDash applied this approach and reported a 1.8% engagement lift across 30 million users. At DoorDash's scale, that translates to millions of additional interactions per month.
Churn prediction: the behavioral graph
A customer's churn risk depends on more than their own behavior. It depends on the behavior of customers like them. If 40% of customers who bought the same product combination churned within 60 days, that product combination is a churn indicator. The signal is in the graph: customer → orders → products → orders (others) → customer status.
GNNs learn this pattern automatically through message passing. No human needs to hypothesize that product combinations predict churn. The model discovers it from the structure.
Supply chain risk
A manufacturer's risk profile depends on its suppliers, their suppliers, and so on. When a Taiwan Semiconductor fab goes offline, the impact ripples through 3-4 tiers of the supply chain. GNNs model this propagation directly, predicting which downstream products will face delays based on the graph topology of the supply network.
Limitations of standard GNNs
GNNs are not a complete solution on their own. They have three well-documented limitations that drove the next evolution.
Over-smoothing
As you add more message-passing layers, node representations become increasingly similar. After 6-8 layers, all nodes in a connected component can converge to nearly identical representations. This limits the effective depth of standard GNNs to 2-4 layers, which means they struggle to capture patterns beyond 4 hops.
Limited expressiveness
The Weisfeiler-Leman (WL) test defines the theoretical ceiling for GNN expressiveness. Standard message-passing GNNs are bounded by the 1-WL test, meaning there are graph structures they provably cannot distinguish. In practice, this means certain relational patterns are invisible to standard GNNs.
Scalability
Enterprise graphs can have billions of edges. Full-batch training on these graphs exceeds GPU memory. Mini-batch training with neighbor sampling introduces approximation errors. Scaling GNNs to production-size enterprise databases requires careful engineering.
Standard GNN limitations
- Over-smoothing after 4-6 layers limits depth
- Bounded by 1-WL expressiveness
- Scalability challenges on billion-edge graphs
- Trained from scratch for each task
- Requires task-specific labeled data
Graph transformer / foundation model
- Attention mechanism avoids over-smoothing
- Global attention exceeds WL expressiveness bound
- Pre-training amortizes compute across tasks
- Zero-shot or few-shot predictions on new databases
- Learns universal relational patterns from diverse data
From GNNs to graph transformers to foundation models
Graph transformers replace the local message-passing mechanism with global self-attention. Instead of aggregating only from direct neighbors, each node attends to a broader set of nodes, weighted by learned relevance. This overcomes the over-smoothing problem and the expressiveness ceiling.
The key paper was Relational Deep Learning (ICML 2024, Robinson, Fey, et al.), which showed that representing a relational database as a temporal heterogeneous graph and training a graph transformer on it outperforms both manual feature engineering and standard GNNs.
KumoRFM extends this by pre-training a graph transformer on billions of rows across thousands of diverse relational databases. The model learns universal patterns: recency effects, frequency signals, temporal dynamics, graph propagation. At inference, it generalizes to new databases it has never seen.
The trajectory mirrors what happened in NLP. Word2Vec (2013) showed that learned representations beat hand-crafted features. LSTMs and attention mechanisms (2014-2017) improved sequence modeling. Transformers (2017) enabled pre-training at scale. GPT (2018-2023) showed that a single pre-trained model could generalize to new tasks.
For graphs, the sequence is similar: GNNs showed that learned representations beat hand-crafted features. Graph transformers improved expressiveness. Foundation models (KumoRFM) showed that a single pre-trained model can generalize to new relational databases. The difference is that while GPT operates on text sequences, KumoRFM operates on the graph structure of relational databases, which is the native format of enterprise data.
If your ML pipeline starts by flattening a graph into a table, you are paying a tax in both time (feature engineering) and accuracy (information loss). GNNs and their descendants remove that tax. The database is the model input. The graph is the feature.