Kumo AI Wins Most Innovative AI Technology at 2024 A.I. Awards! Learn more

04/07/2023

Scalable Graph Learning in Enterprise

Efficient GNN model training using Kubernetes and smart GPU provisioner

Authors:  Dong WangManan ShahArunkumar Eli

Motivation

Graph neural networks (GNNs) have emerged as a leading solution for machine learning (ML) applications, as many real-world problems and data can be effectively modeled as graphs. The development of tools like PinSage and the widespread adoption of open-source software such as PyG has significantly contributed to the adoption of graph machine learning, further increasing the popularity of GNNs. For a comprehensive overview of suitable use cases and key advantages of graph ML, please refer to this blog post.

However, building a production-grade graph ML platform poses significant challenges for data teams. Running a Software-as-a-Service (SaaS) at scale introduces considerable operational complexity and overhead due to the demands of executing models in the cloud. In addition to intensive compute and storage requirements, a platform must deliver fast response times to numerous concurrent customers while minimizing the cost of expensive hardware usage.

Kumo.AI empowers enterprise users to seamlessly leverage graph ML to develop, evaluate, and deploy state-of-the-art predictions in production within hours instead of months. Regardless of their machine learning experience, developers and analysts can directly leverage out-of-the-box predictions for use cases such as customer acquisition and retention, cross-selling, app notification timing, content personalization, next-best action prediction, entity resolution, and more. Advanced ML practitioners can delve deeper into problem setup and modeling customization based on their domain knowledge and can extract high-quality vector embeddings for downstream applications.

The Kumo platform automates all significant steps in a typical ML pipeline, including target label engineering, feature engineering, model architecture definition, AutoML search space definition and tuning, and model deployment. This is achieved through a declarative interface for defining Predictive Queries optimized for GNN specific modeling approaches. The platform, which features a graphical UI, REST API, and Python SDK interfaces, is available as a managed SaaS offering with no hardware or software to install, configure, manage, or maintain.

Users start by defining a Kumo Graph, representing their business operations with dimension and fact tables linked together based on primary and foreign key relationships. Users can easily ingest these tables through built-in connectors to AWS S3, Redshift, Snowflake, and other common data sources. Kumo provides data validation and visualization functionality to prevent data issues and offers dedicated preprocessing options for different column semantic types (numerical, categorical, timestamp, ID, text, etc.). Kumo Graphs can scale to terabytes in size and tens of billions of rows across numerous linked tables.

Kumo’s Platform Architecture

Kumo’s platform architecture is designed around a central orchestration engine featuring dynamic task queues, enabling horizontal scalability and the ability to orchestrate concurrent execution of thousands of stateful workflows and activities. The architecture consists of three main components: the control plane, the data plane, and elastic compute clusters.

Figure 1: A high-level representation of the Kumo distributed architecture.

The control plane, located at the top of Figure 1, is composed of a set of microservices responsible for the highly-concurrent execution of lightweight, short-duration  tasks related to business logic. This includes a metadata manager and security/access controls. The control plane initiates and monitors workflow execution, manages  and serves metadata, and captures the state information about job results. 

The data plane, represented as the cache in Figure 1, provides a cache storage for data, metadata, states, and artifacts. The control plane and the elastic compute clusters both read from and write to the data plane.

The elastic compute clusters are responsible for hosting long-running, compute-intensive workloads for each query executed on the platform. These workloads encompass data ingestion and processing jobs, the construction of Kumo’s internal graph representation, and our auto-scaled Kubernetes (EKS) and Spark (EMR) environments, along with their associated distributed and scalable feature and graph stores. This configuration allows us to achieve on-demand autoscaling of our elastic compute clusters within minutes.

Operationalizing Large-Scale Machine
Learning in the Public Cloud

In this article, we’ll focus on the model training and execution part of the Kumo platform, which primarily utilizes the data plane and elastic compute clusters in Figure 1. We’ll walk through the key challenges faced when designing and implementing a scalable, distributed training pipeline able to operate on graphs with billions of nodes and edges, and how our architecture solves them for customers with low costs and high throughput.

The Challenges

The design and development of Kumo’s distributed graph learning architecture was based around three fundamental design principles, centered around best supporting customer needs. In particular, we strove to achieve:

  1. Scalability to support training hundreds of models in parallel at high throughput, in order to achieve efficient AutoML optimization
  2. Reliable generation of batch prediction outputs for downstream customer use-cases
  3. Swift and dynamic resource provisioning based on customer needs while minimizing idle hardware utilization

Furthermore, we aimed to support the above capabilities in a cost-effective manner. Along with the traditional challenges with infrastructure scalability and training machine learning models on large data, we faced two challenges specific to deploying such platforms in the public cloud:

  1. Operational complexity: As a SaaS company, our model training and inference workloads can vary significantly based on data size, computation cost, and customer training and prediction schedules. Proper resource sizing cannot be determined beforehand and preconfigured, so we must ensure fast model training and inference in our design to serve hundreds or thousands of customers.
  2. Cost complexity: Training GNNs demands substantial memory and compute resources, with training time scaling directly with graph size. While GPUs offer high efficiency for training due to parallel computation capabilities, the instances hosting GPUs are expensive to use in the public cloud. Ensuring cost-effective machine
    learning solutions for our customers is crucial.

Kumo’s Solution: A Distributed, Dynamic Graph Learning Pipeline

At Kumo, we achieve goals (1)-(3) while reducing operational and cost complexity by carefully designing our cloud and GNN training infrastructure.

Cloud Infrastructure

Our solution at Kumo.AI consists of two major components: the Kubernetes orchestration engine and the Smart Node Provisioner.

  1. Kubernetes Orchestration Engine: Kubernetes forms the backbone of Kumo’s infrastructure, orchestrating training and prediction jobs. It offers benefits such as packaging training code and dependencies as Docker containers, using Kubernetes Jobs for transient trainer pods, and leveraging node scaling mechanisms to avoid idle virtual machines. 
  2. Smart Node Provisioner: A key design consideration for an enterprise-scale product is the ability to quickly provision compute resources on demand. The Smart Node Provisioner ensures the efficient allocation of resources based on user requirements.

GNN Training Infrastructure

Kumo’s production-ready, scalable graph neural network pipeline is built upon an elastic, heterogeneous compute architecture powered by the aforementioned cloud infrastructure. It delivers high-throughput training and low-latency inference on large-scale input data, operating at a scale of tens of billions of nodes and over 100 billion edges. Our approach to addressing the complexities associated with scaling graph machine learning involves modularizing the training process; specifically, we separate each component of the graph learning pipeline as a scalable service.

We implement a combination of an in-memory graph store and an on-disk feature store to scale up graph learning at Kumo. In this architecture, the graph store is solely responsible for graph storage and sampling, the feature store handles feature fetching, and the trainer manages the forward/backward pass on each training example (sampled subgraph joined with node and edge features). The in-memory graph store is optimized for rapid access of a node’s neighbors and subgraph sampling, while the on-disk feature store is optimized for high data volume and high feature fetch throughput. The overall interaction between the three modularized components is depicted in Figure 2: the compute node obtains sampled subgraphs from the graph store, performs feature lookups for the nodes in the sampled subgraph in the feature store, and joins these data to produce minibatches for execution in the trainer node.

Figure 2: The interaction between the feature store, graph store, and trainer (compute) node to power a distributed graph learning pipeline.

This architecture allows us to independently scale each component of our pipeline based on data size and bottlenecks. It is also highly cost-efficient, as each service can utilize instances specialized for its purpose. For instance, our training node is a GPU instance with the memory needed to store the minibatches it processes in each iteration. Our feature store uses an instance with attached SSDs and high disk memory, and our graph store employs a memory-optimized instance for fast random access in DRAM. Alongside this functionality, Kumo provides MLOps tooling, training orchestration, and model management to support the end-to-end ML workflow.

Scaling Compute via Parallel Training

The separation of the graph wireframe and features from the compute (accelerator) training node in Figure 2 allows us scale both components independently. Doing so further enables Kumo to run numerous training jobs in parallel with minimal resource requirements on the compute nodes by sharing feature and graph stores across training jobs. These trainers independently communicate with the feature and graph stores through a shared minibatch fetcher, which fetches and caches mini-batches requested by trainers.

Parallel training (e.g. as part of AutoML search space exploration) is further enhanced by autoscaling trainers with an intelligent selection of GPU instance types. The Compute Engine employs a lightweight driver that manages a Kubernetes cluster of trainer instances. When the AutoML searcher requests a new job, the driver launches a trainer on-demand, selecting the type of instance based on the training job configuration.

Figure 3: A depiction of an efficient parallel training scheme utilizing a “minibatch fetcher” to limit thrashing of the feature or graph store while maintaining high downstream training throughput.

The resulting pipeline is highly scalable and cost-efficient, saving orders of magnitude in expenses. Furthermore, it offers significant flexibility, allowing new machine learning approaches to be integrated with minimal additional overhead. Each compute node is also autoscaled on-demand to support concurrent users and jobs.

Further Work

Our current solution is robust and can handle hundreds of concurrent training jobs that scale quickly to meet demand of execution. The next phase of the Kumo platform will be designed to expand to thousands of concurrent jobs. In order for this to happen, we will need to guarantee availability of compute resources (especially GPUs) across multiple regions and zones. The focus of future development will be to incorporate Karpenter in AWS and expand support for other deployment areas, including:

  1. Multi-Cloud and Multi-Cluster Pattern: we need a flexible job dispatcher that operates across multiple regions and clouds. This introduces an additional layer of complexity when running a job, as Karpenter must now determine the optimal GPU and cluster to use. 
  2. Proactive Autoscale: we plan to enhance our autoscaling system to provide an improved user experience. Our redesign will scale resources well before a machine learning workload is triggered, guaranteeing timely availability while maintaining low utilization costs. Additionally, we will offer real-time visualizations of training progress for increased transparency. 
  3. Consistent and Manageable Platform: we need to continue to deliver consistency even as we expand across new use cases, algorithms/models, and users. A well-designed job controller and job monitor will enable our platform to execute machine-learning jobs on predictable hardware between iterations. Frequent checkpoints and a failover mechanism will ensure enhanced reliability

Experience Kumo AI Cloud Today

Kumo AI Cloud is built with security and reliability at its core, providing a trusted platform for businesses to harness the power of AI. By choosing Kumo AI Cloud, you can be confident that you are using a secure, reliable, and high-performance platform designed to meet the needs of today’s AI-driven businesses. Experience the difference with Kumo AI Cloud – request a demo or sign up for a free trial today here

Special thanks to Ivaylo Bahtchevanov and Hema Raghavan for reviewing this blog.