06/20/2023
Bringing Machine Learning to the Enterprise Lakehouse
Author: Ivaylo Bahtchevanov
Introduction
Kumo.ai presents an entirely new approach to performing machine learning at scale, one that drastically simplifies the end-to process and accelerates time-to-value. The platform automates the end-to-end process, making it as easy to query the future with predictive ML as it has been to query the past with SQL.
With the partnership between Kumo and Databricks, enterprises can harness state-of-the-art machine learning directly where their data is stored. The Databricks Lakehouse Platform brings together all structured, semi-structured, and unstructured data into a single platform to offer both data warehousing and AI use cases on a single platform.
Today, the lakehouse stores many types of relational data in the enterprise – logs, event streams, historical analytics, feature stores, and more. Traditionally, performing machine learning over relational data is very difficult for a number of reasons.
Challenges with Traditional Machine Learning
Data processing and feature engineering are complex, expensive, and error-prone
Building models using relational data spanning multiple tables requires many joins, which become very costly and time-consuming when processing data at scale.
What’s more, these models require a fixed input, meaning the data needs to be processed into a single table training set. But these transformations enforce limitations on the representation of the data and discard the structural and contextual information. Additional feature engineering can introduce bias by imposing additional constraints on the representation of the data and imposing assumptions.
Pipelines and models are rigid, with limited adaptability to new predictions or use cases
Each new use case or prediction requires an entirely new data structure, dedicated feature engineering, pipeline, and model to power the prediction. Creating new data pipelines, setting up the infrastructure, and building new models every time you have a new problem to solve is expensive and time consuming. This limits the number of experiments a given company can perform, and forces teams to adopt more generic models over highly personalized ones.
Operational complexity of the machine learning lifecycle
Models become more complex and expensive over time. People add features and seldom remove old ones, creating significant bloat and increasing the requirements to maintain pipelines. When a data scientist leaves, they often take the knowledge with them, resulting in abandoned pipelines. Moreover, complex deployment involves duct-taping multiple tools across the lifecycle for feature engineering, feature stores, retraining orchestration pipelines, monitoring, experimentation, and other production tooling.
Kumo: A New Way to Learn over Relational Data
Kumo brings the representational learning approaches that eliminated the need for extensive feature engineering and training set generation in computer vision and natural language, but now for relational data.
Kumo works directly on raw enterprise relational data – no more building complex ML pipelines, generating training sets, long feature engineering and processing cycles, or maintaining pipelines over time. By using graph representational learning, Kumo leverages the relational form of the underlying data to learn directly from your tables and maximize signal to improve accuracy and performance.
Under the hood, Kumo builds an enterprise graph from the relational data and provides a low code abstraction that enables users to define the ML task and generate ML predictions quickly and seamlessly.
So what does this mean for your business?
- Deliver More Models: Build your graph once by connecting your lakehouse tables, then use it to generate any number of predictions for any number of use cases.
- Better Performance: Kumo leverages the latest approaches and identifies the best model and parameters for your specific problem and corresponding graph. Kumo can also power existing models by feeding trained embeddings directly into established models to improve accuracy.
- Cheaper and Easier: Since Kumo operates directly on your raw tables, you simplify infrastructure and optimize costs – no need for ML pipelines, feature engineering, feature stores, and production tooling.
- Scalable and Performant: Kumo can rapidly train and predict on graphs at massive scale, up to tens of terabytes of data.
- Turn-Key: Kumo is a single platform that can manage your entire ML lifecycle, with the fastest time-to-ROI and payback.
Common Applications
Kumo is currently used by enterprises to power their growth and GTM use cases. Build your graph once, and immediately query the future and generate highly accurate predictions for any of the following common applications:
- Optimizing customer loyalty and retention
- Personalizing experiences for users and recommending relevant product and content
- Powering cross-selling and up-selling strategies
- Predicting future purchases/activities and identifying potential high value customers
- Optimizing customer outreach and notifications strategies
- Resolving entity resolution for search and retrieval
- Detecting fraud and abuse
Technical Requirements for Applications
In order to start building ML applications, users need to have several major components in place.
- Scalable and performant graph construction, which converts the relational data from the tabular form into the equivalent graph representation suitable for graph learning
- Storage optimized for heterogeneous graphs that will support efficient data loading and mini-batch generation
- MLOps that can support graph data and model management end-to-end
Bridging the Gap with Databricks
The Databricks Lakehouse Platform provides many of the essential building blocks to quickly build ML applications using Kumo. With the new integration, users can take advantage of the following Databricks features directly under the hood when they use Kumo:
- A scalable and performant compute engine for graph construction
- Lakehouse for storing and managing the data artifacts produced by Kumo, which includes the graph created from the data and the fully trained models
- MLOps stack to streamline moving from exploration to production with GNN models.
By running Kumo inside the Databricks instance, customers also enjoy the following benefits:
Ease of control and access: Admin can easily view and configure which data gets shared with Kumo. Users can access Kumo via SSO.
Data security and governance: Running Kumo as a native hosted app in Databricks ensures that data never leaves the Databricks environment. In addition, by storing Kumo produced data artifacts inside the lakehouse, it could also enable data lineage tracking.
What this means
With the latest partnership, it will be easier than ever to build new ML applications with zero additional overhead. Whether you’re an advanced ML practitioner, an app developer, analyst, or a line of business owner, you can leverage Kumo’s low code capabilities directly from your enterprise lakehouse for any use case.
If you’re interested in learning more, you can read the latest press release.