01/19/2023
Kumo: Amplify Your Machine Learning Workflow
Author: Ivaylo Bahtchevanov
As every data scientist, applied ML engineer, and research scientist knows, it takes a significant number of manual steps to go from a business problem with raw data to a fully operational production model. Kumo is here to help as little, or as much as you need. Kumo’s platform is designed to supercharge and amplify the capabilities of ML teams, so they can focus on their most important tasks.
So, what can Kumo do for your team? It can either create and deploy an entire machine learning application, or generate embeddings for existing models. In this blog we’ll focus on how to make the most of Kumo’s embeddings, but you can see all of the ways that we can amplify the data scientist’s workflow below:
Let’s take a closer look at each step of this journey, and how Kumo can play a role.
Ingest and Understand the Data
Manual feature engineering is by nature error-prone and time-intensive. It takes considerable effort to understand the problem space and the relevant data. It often requires manually defining encoders onto raw data to infer semantic meaning from columns. If done incorrectly, this can introduce bias.
Similarly, if you manually preprocess the raw tables into graph format, you need to perform extensive join pipelines. This is where the risk of data leakage due to temporal constraints is introduced. This also involves a lot of computational and spatial complexity.
The Kumo platform, powered by a robust understanding of graphs and graph learning, directly leverages the relational structure of the entities in the data to build a single enterprise graph. This graph represents a comprehensive view of the dynamic interactions and relationships between the different entities in the raw data. This means no feature engineering. No manual preprocessing.
Instantly Generate High Quality Embeddings from Single Graph
The best embeddings are created using supervised learning approaches that are closely aligned with the typical supervised learning ML problem they feed into. Consequently, using traditional ML approaches, you redefine shallow embeddings for each prediction problem based on the end goal.
The embeddings Kumo provides are generated from the attributes of each node in the graph as well as the surrounding nodes and edges. You can think of the data spanning your entire enterprise as a single graph – by leveraging the entire set of entities, relationships, and interactions, you can create an embedding that captures this information and learns the context around the entire data set, providing more signal than traditional shallow embeddings.
To generate high quality embeddings – just point the API to your tables (once!), and Kumo will do the rest, namely create an enterprise graph with your data, leveraging state-of-the-art Graph Neural Networks on top of the leading graph learning framework PyG. GNNs are highly effective at learning relationships across your entire data. You can read why in our graph learning blog.
Once you connect your data tables to Kumo, you can directly generate high-quality, fully optimized bespoke embedding vectors for any entity.
The time-consuming and intensive process of data preprocessing and feature engineering are entirely abstracted away and performed under the hood. Once Kumo builds the graph, you can either continue using the Kumo interface to write predictive queries and complete the ML process end-to-end, or you can take those optimized embeddings and use them directly in existing pipelines and turbocharge downstream applications.
Another benefit of generating deep embeddings using the graph is the ability to generate embeddings at inference time on new, unseen entities. Because the graph-generated embeddings pull in context from other connections, you can make accurate predictions on entities with little to no historical data on the given entity. See how AirBnB uses graph learning to overcome the cold-start problem when making predictions and recommendations for hosts that have just joined the platform here.
Scaling Your Graph to Production Use Cases
Real world datasets are very large, with billions of interactions leading to graphs containing terabytes of data. While traditional deep learning models operate on static examples to generate predictions, GNN training can be harder to scale-out because the graph must be persisted and accessed throughout the training process. Most existing solutions do not provide vertical scalability for graph and feature storage out-of-the-box, so you would need to implement this yourself. Production pipelines also need to be robust to handle edge cases, and an infra team is required to monitor failures.
Kumo solves all of these problems for you, and in a fraction of the time. The platform utilizes best practices in both graph learning as well as deployment methodologies and modularizes the training process, separating each component as an independently scalable service. Using the Kumo API, you receive the benefit of full horizontal scaling out-of-the-box.
Kumo converts your entire enterprise graph into embeddings out-of-the-box – this means each entity’s embedding is leveraging the full context of your data in a scalable manner. You can think of this as a “company-to-vec” approach. See how Pinterest used embeddings successfully in their paper and blog.
Lifecycle Management and Dynamic Graph Refresh
If your data is large and constantly evolving, it is important to ensure your embeddings are retrained to reflect the most up-to-date state of the world.
Kumo provides you with lifecycle management of your business graph. The graph will automatically refresh as new data comes in or as underlying tables change. This means you can directly pull the embeddings optimized on the latest, most up-to-date graph.
What’s more, because Kumo can generate quality embeddings at prediction time by leveraging the context of the surrounding subgraph, you can always have up-to-date embeddings on new data without retraining the entire system multiple times a day. The net-net here is that the time to updated models drops dramatically, along with the effort required.
Scaling to Turbocharge Many Downstream Workflows
In a typical ML architecture, if you want to define more than one predictive problem on the same data, this can require multiple ingestion pipelines and data workflows for each problem.
Using Kumo, you only define the graph once and then re-use for any downstream problems. Kumo builds and maintains a single business graph that captures information across all of your tables, understands how the columns and entities are related to one another, and evolves with your data. You can replace dozens (dozens!) of feature engineering pipelines with one single pipeline that can be reused across many applications.
So what sort of problems can these embeddings solve?
These embeddings are ideal features for any downstream ML task whether it’s discovering feature sets for supervised ML pipelines, powering labeling loops, understanding similarities between entities, improving clustering, segmentation, similarity / nearest neighbor search, or anomaly detection. You can read some examples of how Kumo can power any personalization and recommendation workflows here, and examples of how Kumo can be useful in predicting fraud and abuse here.
If you’re interested in learning more, you can request a demo here!