Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
Learn18 min read

Graph ML for Enterprise: A Practical Guide

Graph ML has moved from research papers to production systems at DoorDash, Snowflake, and Reddit. This guide covers when it adds value, how to evaluate it, and what deployment actually looks like.

TL;DR

  • 1Graph ML is in production at DoorDash (1.8% engagement lift, 30M users), Pinterest (450M MAU), Visa (billions of transactions), Snowflake (3.2x expansion revenue lift), and Databricks (5.4x conversion lift).
  • 2On RelBench, GNNs achieve 75.83 AUROC vs. 62.44 for LightGBM with manual features -- a 13.4-point improvement from capturing multi-hop patterns, temporal sequences, and network effects that flat tables miss.
  • 3Graph ML adds the most value with 3+ connected tables, network-effect tasks (fraud, recommendations), and cold-start entities. It adds little value on single-table data or extremely sparse graphs.
  • 4Three paths to production: custom GNN ($500K-1M, 3-6 months, full control), graph ML platform ($50K-150K, 2-4 weeks), or relational foundation model ($5K-20K/task, minutes to first prediction).
  • 5Start with a 30-day evaluation: identify 3 multi-table prediction tasks, run zero-shot foundation model predictions, compare against production baselines, and build the business case.

In 2021, DoorDash published a blog post describing how they rebuilt their recommendation system using graph neural networks. The result was a 1.8% engagement lift across 30 million users. That may sound small until you calculate the revenue impact at DoorDash's scale: roughly $50 million in incremental annual GMV.

Two years later, graph ML is running in production at Pinterest (content recommendations for 450 million monthly users), Visa (fraud detection on billions of transactions), Snap (friend suggestions), and dozens of Fortune 500 companies that do not publicize their implementations. Snowflake used it internally for expansion revenue prediction and saw a 3.2x lift over their previous gradient-boosted model.

Yet most enterprise data science teams have never shipped a graph ML model. The gap is not technical. It is informational. Teams do not know when graph ML adds value, what the evaluation criteria should be, or what deployment actually requires. This guide addresses all three questions.

What graph ML actually does

Every machine learning model needs to find patterns in data. Traditional models (logistic regression, XGBoost, random forests) find patterns in flat tables: rows of features, one per entity. Graph ML finds patterns in connected structures: entities as nodes, relationships as edges.

The distinction matters because most enterprise data is inherently relational. A customer is connected to orders, orders to products, products to categories, categories to seasonal trends. A patient is connected to diagnoses, prescriptions, lab results, providers, and insurance claims. These connections carry predictive signal that flat tables cannot represent.

Consider churn prediction. A traditional model might use features like days since last purchase, total spend, and number of support tickets. A graph ML model sees that plus the following: the customer's neighbors (people who bought similar products) are churning at 3x the normal rate. The products they recently purchased have a 40% return rate. The support agent they interacted with has a resolution rate 20 points below average.

None of those signals exist in a flat feature table unless someone manually engineers them. And nobody does, because the combinatorial space of possible multi-hop features is too large to explore by hand.

production_graph_ml_deployments

CompanyUse CaseGraph ScaleResultYear
DoorDashRecommendations30M users, heterogeneous1.8% engagement lift2021
PinterestContent recommendations18B pins, 450M MAUCore ranking system2018
VisaFraud detectionBillions of transactionsFraud ring detection2020
SnowflakeExpansion revenueAccounts-users-queries graph3.2x lift over GBT2023
DatabricksLead scoringCompanies-contacts-usage graph5.4x conversion lift2023

These are published production deployments, not research prototypes. Graph ML is running at Fortune 500 scale.

How GNNs learn from structure

A graph neural network works through message passing. Each node collects information from its neighbors, aggregates it, and updates its own representation. After multiple rounds, each node's embedding encodes information from its entire local neighborhood, not just its own attributes.

With 3 rounds of message passing, a node's representation captures information from every entity within 3 hops. For a customer in an e-commerce graph, that includes their orders, the products in those orders, other customers who bought those products, and those customers' order histories. That is the kind of signal that takes a data scientist weeks to engineer manually and typically still misses the most informative patterns.

gnn_message_passing_concrete_example

LayerCustomer C-401 SeesNew InformationEmbedding Updates
InputOwn attributes onlysegment=Enterprise, tenure=3yr, ARR=$180KInitial node vector
Layer 15 orders, 2 support ticketsavg_order=$12K, 1 escalated ticketAdds interaction patterns
Layer 28 products, 3 support agents, 4 invoices2 products have 30%+ churn rate among buyersAdds product-risk signal
Layer 352 similar customers (via shared products)38 of 52 are active; 14 churned last quarterAdds peer-behavior signal

By layer 3, the model knows that 27% of similar customers (those who bought the same products) churned recently. This peer-churn signal is the strongest predictor but requires traversing 3 tables -- no flat feature table captures it.

When graph ML adds value over traditional approaches

Graph ML is not universally superior to traditional ML. There are specific conditions where it provides measurable uplift, and conditions where the added complexity is not justified.

High-value scenarios

Graph ML consistently outperforms traditional approaches in four situations:

  • Multi-table relational data. If your prediction depends on information spread across 3 or more tables, graph ML eliminates the feature engineering bottleneck. The RelBench benchmark showed that GNNs outperformed LightGBM with manual features on 11 of 12 classification tasks across 7 multi-table databases.
  • Network effects matter. Fraud detection, social recommendations, and marketplace dynamics all depend on how entities relate to each other. A fraudulent transaction is not just about the transaction attributes; it is about the network of accounts, devices, and merchants connected to it.
  • Cold-start problems. New users with no history are impossible for traditional models. Graph ML can predict their behavior based on the entities they are connected to: the product they first viewed, the channel they came from, the referrer who invited them.
  • Feature engineering is the bottleneck. If your data science team spends 60% or more of their time writing SQL joins and aggregations, graph ML eliminates that step entirely. The Stanford study measured the cost: 12.3 hours and 878 lines of code per prediction task for experienced data scientists.

Low-value scenarios

Graph ML adds less value in these situations:

  • Single-table data. If your prediction depends on one flat table with well-defined features, XGBoost or a neural network will perform comparably with less infrastructure.
  • Extremely sparse graphs. If entities have very few connections (less than 2 edges per node on average), the graph structure carries minimal signal.
  • Real-time latency under 10ms. GNN inference on large graphs can take 50 to 200ms. If you need sub-10ms latency, you will need a pre-computed embedding approach or a simpler model.

Traditional ML approach

  • Flatten 10-50 tables into a single feature table
  • Engineer 100-500 features manually per use case
  • Miss multi-hop and network-effect signals
  • Rebuild pipeline for every new prediction task
  • 3-6 month cycle per model

Graph ML approach

  • Represent database as a graph directly
  • Model learns features from relational structure
  • Captures multi-hop, temporal, and network patterns
  • Same architecture handles any prediction task
  • Days to weeks per model with foundation models

graph_ml_value_by_scenario

ScenarioTraditional ML AUROCGraph ML AUROCUpliftWhy Graph Wins
Multi-table relational62.4475.83+13.4 ptsCross-table patterns
Fraud with ring patterns71.278.4+7.2 ptsNetwork topology
Cold-start users~50 (random)65-75+15-25 ptsNeighbor signal propagation
Single flat table82-8582-85~0 ptsNo structural advantage
Sparse graph (<2 edges/node)75-8076-81+1-2 ptsMinimal graph signal

Graph ML's advantage scales with relational complexity. On single-table data, XGBoost remains competitive.

Evaluating graph ML: the right benchmarks

Most ML benchmarks use single-table datasets (UCI, Kaggle competitions) where graph ML has no structural advantage. To evaluate graph ML properly, you need benchmarks designed for relational data.

RelBench: the standard benchmark

RelBench is the first standardized benchmark for ML on relational databases. Published at NeurIPS 2024 by researchers at Stanford and Kumo.ai, it includes 7 databases, 30 prediction tasks, and over 103 million rows. Each database has 3 to 15 interconnected tables. Tasks include classification (churn, fraud, conversion) and regression (lifetime value, demand forecasting).

The benchmark enforces temporal splits: training data comes before the evaluation period, and test data comes after. This prevents data leakage, which inflates accuracy in many published results. On RelBench, GNNs achieve an average AUROC of 75.83 on classification tasks, compared to 62.44 for LightGBM with features engineered by a Stanford-trained data scientist. KumoRFM zero-shot reaches 76.71, and fine-tuned reaches 81.14.

What to measure for your use case

Do not evaluate graph ML on aggregate metrics alone. The real value shows up in specific slices:

  • Cold-start entities. Measure accuracy on entities with fewer than 5 historical events. This is where graph ML's structural advantage is largest, often 15 to 30 percentage points of AUROC improvement.
  • Multi-hop signal tasks. Pick a prediction where the correct answer depends on information 2 or more hops away. For example, predicting whether a customer will return a product based on the return rates of similar products bought by similar customers.
  • Temporal dynamics. Measure on tasks where the pattern changes over time. Graph ML with temporal encoding captures shifts that static feature tables miss.

PQL Query

PREDICT COUNT(transactions.*, 0, 7) > 3
  AND AVG(transactions.amount, 0, 7) > 5 * AVG(transactions.amount, 0, 90)
FOR EACH accounts.account_id

Fraud detection via PQL. The model uses graph structure to identify accounts with anomalous transaction patterns relative to their network of connected merchants, devices, and counterparties.

Output

account_idfraud_riskconfidencegraph_signal
ACC-882910.94highConnected to 3 flagged merchants
ACC-120470.08highNormal pattern for network cluster
ACC-551030.82mediumNew device + unusual merchant graph
ACC-378200.03highEstablished pattern, trusted network

Three paths to production graph ML

Enterprise teams have three viable paths to deploy graph ML, each with different trade-offs on control, speed, and required expertise.

Path 1: Build a custom GNN pipeline

Use PyTorch Geometric or DGL to build a graph neural network from scratch. You control every architectural decision: message passing layers, aggregation functions, attention mechanisms, training procedure.

Requirements: 2 to 3 ML engineers with GNN experience, 6 to 12 months for the first production model, GPU infrastructure for training. Pinterest and DoorDash took this path.

Best for: Organizations with deep ML teams, unique graph structures that differ significantly from standard relational databases, and prediction tasks where architectural customization matters.

Path 2: Use a graph ML platform

Managed platforms handle graph construction, training, and serving. You provide data and define the prediction task. The platform handles the GNN architecture, hyperparameter tuning, and deployment.

Requirements: 1 data scientist, 2 to 4 weeks for the first model. Infrastructure is managed.

Best for: Teams that want graph ML without building GNN expertise in-house. Good for standard relational prediction tasks (churn, fraud, recommendations).

Path 3: Relational foundation model

A pre-trained foundation model like KumoRFM that already understands relational patterns. You connect your database, define the prediction task in one line of PQL (Predictive Query Language), and get predictions. No training, no graph construction, no feature engineering.

Requirements: Any team member who can write SQL. Minutes to first prediction. Zero-shot predictions available immediately; fine-tuning takes hours for higher accuracy.

Best for: Organizations that want the accuracy benefits of graph ML without the infrastructure investment. Ideal for teams running many different prediction tasks across the same relational database.

three_paths_comparison

DimensionCustom GNNGraph ML PlatformFoundation Model
Team Required2-3 GNN specialists1 data scientistSQL-literate analyst
Time to First Model3-6 months2-4 weeksMinutes (zero-shot)
First Model Cost$500K-1M$50K-150K$5K-20K
Marginal Cost/Task$50K-200K$20K-50KNear-zero
Architectural ControlFullLimitedNone
Best For1-2 unique models5-10 standard tasks10+ tasks at scale

The right path depends on how many prediction tasks you need. For most enterprises running 5+ tasks, the foundation model path dominates on economics.

Deployment architecture for graph ML

Production graph ML systems have four components that differ from traditional ML deployments.

1. Graph construction layer

Your relational database needs to be represented as a graph. In custom pipelines, this means ETL jobs that extract entities and relationships and build adjacency matrices. For a 100-million-row database, initial graph construction takes 2 to 8 hours; incremental updates take minutes.

Foundation models handle this automatically. KumoRFM reads your database schema, identifies entity types and relationships from foreign keys, and constructs the temporal graph on the fly.

2. Embedding computation

GNNs produce embeddings (dense vector representations) for each node. These embeddings encode the node's attributes and its graph neighborhood. For batch predictions, embeddings are computed offline and stored. For real-time predictions, you need a serving layer that can compute or retrieve embeddings in under 100ms.

3. Prediction serving

Batch predictions are straightforward: run the GNN overnight, store scores in a database, serve them through your existing application. Real-time predictions require a model serving infrastructure (Triton, TensorFlow Serving, or a custom gRPC service) that can handle graph lookups and GNN inference at request time.

Most enterprise deployments start with batch. Fraud detection is the primary use case that requires real-time serving. Recommendations, churn, and lead scoring typically run daily or hourly batches.

4. Graph update pipeline

Graphs change as new transactions, users, and interactions arrive. Your pipeline needs to handle incremental graph updates without rebuilding the entire graph. For custom GNNs, this is the hardest engineering challenge. For platform and foundation model approaches, updates are handled by the platform.

Production results across industries

The following results come from published case studies and benchmark results, not synthetic examples.

E-commerce and marketplaces

DoorDash: 1.8% engagement lift on recommendations across 30 million users using a heterogeneous graph of customers, restaurants, menu items, and delivery interactions. Pinterest: graph ML powers content recommendations for 450 million monthly active users, with the graph containing over 18 billion pins and 200 million boards.

Financial services

Visa: graph-based fraud detection processes billions of transactions, identifying fraud rings that transaction-level models miss. On the RelBench credit card fraud benchmark, GNNs achieve 78.4 AUROC compared to 71.2 for LightGBM with manual features, a 7.2-point improvement that translates to millions in recovered fraud losses.

B2B SaaS

Snowflake: 3.2x lift in expansion revenue prediction by modeling the graph of accounts, users, queries, datasets, and feature usage. Databricks: 5.4x conversion lift on lead scoring by incorporating the relationship graph between companies, contacts, product usage events, and support interactions.

Healthcare

On the RelBench clinical trial benchmark (15 tables, 2.3 million rows), GNNs predict adverse drug events with 12 points higher AUROC than flat models, by learning from the graph of patients, conditions, medications, and treatment protocols.

Common objections and honest answers

"Our data scientists do not know graph ML"

Foundation models remove this barrier. KumoRFM requires zero graph ML expertise. You write a prediction query in PQL, which looks like SQL with a PREDICT clause. If your team can write SQL, they can use a relational foundation model.

"We tried a knowledge graph and it did not work"

Knowledge graphs and graph ML are different things. Knowledge graphs are symbolic (RDF triples, SPARQL queries, ontology engineering). Graph ML is statistical (learned embeddings, message passing, gradient descent). The failure of a knowledge graph project says nothing about graph ML's viability.

"We cannot afford GPU infrastructure"

Custom GNN training requires GPUs. Foundation model inference does not require you to own GPUs because the model runs on managed infrastructure. KumoRFM's zero-shot predictions run in seconds without any training step.

"Our graph is too large"

GNN scaling has improved dramatically. Mini-batch training with neighbor sampling allows training on graphs with billions of edges using a single GPU. Pinterest trains on a graph with 18 billion pins. The RelBench benchmark includes datasets with 41 million rows. If your data fits in a relational database, it fits in a graph.

Getting started: a 30-day evaluation plan

If you are considering graph ML for your organization, here is a practical evaluation plan that takes 30 days and requires no infrastructure investment.

  • Week 1: Identify 3 prediction tasks where your current models underperform or where feature engineering is the bottleneck. Prioritize tasks that depend on multi-table data.
  • Week 2: Run zero-shot predictions using a relational foundation model on all 3 tasks. Compare AUROC, precision, and recall against your production models.
  • Week 3: Fine-tune on the most promising task. Measure the accuracy improvement from fine-tuning and estimate the business impact using your existing conversion models.
  • Week 4: Build the business case. Calculate the TCO of your current pipeline (team time, infrastructure, maintenance) versus the foundation model approach. Include the time-to-value difference: months versus days.

The 30-day evaluation costs nothing beyond team time. If graph ML does not outperform your current approach on any of the 3 tasks, you have a definitive answer in a month. If it does, you have the data to fund a production deployment.

Frequently asked questions

What is graph ML and how does it differ from traditional ML?

Graph ML is a family of machine learning techniques that operate directly on graph-structured data, where entities are nodes and relationships are edges. Traditional ML requires flattening data into rows and columns. Graph ML preserves the relational structure, allowing models to learn from connectivity patterns, multi-hop relationships, and network effects that flat tables cannot represent.

Which companies use graph ML in production?

DoorDash uses graph ML for recommendations across 30 million users, achieving a 1.8% engagement lift. Pinterest uses it for content recommendations serving 450 million monthly active users. Visa uses it for fraud detection across billions of transactions. Snowflake uses it internally for expansion revenue prediction with a 3.2x lift over baseline models.

What data is required for graph ML?

Graph ML requires relational data with defined connections between entities. Any relational database qualifies: customers linked to orders, orders linked to products, products linked to categories. The minimum useful graph has two entity types and one relationship. Production graphs typically have 5 to 50 entity types and millions to billions of edges.

How long does it take to deploy graph ML in production?

Building a custom GNN pipeline from scratch takes 3 to 6 months for a team of 2 to 3 ML engineers. Using a pre-trained relational foundation model like KumoRFM reduces this to days, because the model already understands relational patterns and requires no graph construction, feature engineering, or model training.

Does graph ML replace existing ML systems?

Graph ML supplements or replaces the feature engineering and modeling layers of existing ML systems. Data pipelines, monitoring, and serving infrastructure remain. In practice, most enterprises run graph ML predictions alongside existing models and gradually shift traffic as they validate performance gains on their specific use cases.

See it in action

KumoRFM delivers predictions on relational data in seconds. No feature engineering, no ML pipelines. Try it free.