09/04/2022
Query the Future
Vanja Josifovski, Jure Leskovec, Hema Raghavan
As the founders of Kumo, we are excited to announce the V1 launch of the Kumo platform, alongside our Series B $18 million fundraise led by Sequoia Capital, with participation from existing and new investors including A Capital, SV Angel, Ron Conway, Michael Ovitz, Frank Slootman, Kevin Hartz, Clement Delangue, and Michael Stoppelman, among others.
Our vision is simple: to make it as easy for you to ‘Query the Future’ as it is for you to query the past using SQL on historic data today. All enabled by the cutting edge of deep learning innovation.
In this post, we explain our thoughts behind forming Kumo.AI in more detail.
Status quo: Predictions as a scarce resource
Today, it is well known that better predictive analytics can drive massive improvements in top and bottom-line business performance. Done right, it enables companies to be vastly more intelligent and proactive in their decision-making, both at a macro/company level and also at the level of each business entity (customer, product, merchant, transaction, etc.). All three of us have seen this in practice at our prior employers: accurate predictions have been some of the key drivers of the success of companies like Google, Pinterest, Airbnb, Linkedin, and others.
Knowing this, companies spend millions of dollars storing terabytes of data in the hopes that they can leverage it for predictive analytics. Still, at the end of it all, they only successfully use a tiny fraction of them to generate predictions. Countless predictive projects kick off with great intentions but drag on from weeks to months, ultimately never seeing production light of day. The reasons for this are varied and include:
- Data scientists are needed for doing custom predictive analytics projects, especially due to the complexity of existing ML platform tooling (for example just count the number of features you have to learn to use a typical ML platform from one of the major clouds), but they continue to be extremely hard to hire and retain
- Training data is a bottleneck and labeling sufficient data is either impossible or still too expensive
- Need to duct tape a dozen tools to deploy ML to production robustly, such as feature engineering pipelines, feature stores, retraining orchestration pipelines, model monitoring, experimentation tooling, etc. Use of any one tool as well as any of the point connections between two tools is error-prone, making the system work end to end even harder.
- Repetitive rework for every new ML task–essentially requiring you to create an entire new ML pipeline each time, and not allowing you to easily transfer the accumulated work/artifacts and learnings from one ML pipeline to another
- Vertical AI saas offerings help initially, but frequently they are far too restrictive in the input assumptions they impose and the predictions/decisions they can generate, and ultimately rarely able to scale to sustained and enterprise-wide use.
Thus, companies live in a world where predictions have to be treated as an immensely scarce resource, requiring a massive investment of time and budget and planned months in advance. The result is tons of lost value, as only a very small subset of use cases get done with predictions as input, and only much later in the company life cycle (when much of the potential impact of predictions might have already passed). For most decision-making, companies rely on (a) backward-looking analytics, which is noisy and error-prone, or (b) waiting until they can take a post-hoc reactive approach, resulting in many missed opportunities.
Our vision: Going from scarcity to abundance
We believe that it is possible to create a world in which the status quo paradigm of painful scarcity is turned on its head, replaced with one of rich prediction abundance–where:
- A single user, in a single sitting, could rapidly create a dozen different sets of production-ready prediction pipelines, throw away 3/4s because they weren’t quite what she wanted, and formally productionize the remainder
- Analysts complement their traditional backward-looking exploratory data analysis (EDA) with ‘predictive EDA’, making it standard practice to not only uncover existing historical trends in their data but also answer a 360-degree range of questions about how those trends will evolve into the future
- State-of-the-art and robust predictions, even where labeled data is sparse, become such an easily obtainable resource that the only bottleneck to adoption becomes the organization’s ability to absorb them into their downstream workflows
Our vision is to create this future, by bringing together several key capabilities:
Combining the power of Graph Neural Networks…
The first is the usage of state-of-the-art Graph Neural Network (GNN) algorithms, made to scale out of the box, automatically, on enterprise data.
For context, neural networks are behind some of the most recent advances in AI, but up until now, limited primarily to images, videos, and text. Their key advantage over traditional ML approaches is they allow users to (a) learn rich representations directly and optimally from the raw data without any usually suboptimal human manual feature engineering in the middle and (b) transfer learn between different use cases, especially from labeled data-rich ones to labeled data-poor ones by sharing intermediate network layers between different ML tasks.
However, many important natural phenomena are not best described by images/videos/text, but instead, are best described by graphs–a much more flexible data structure defined by a collection of nodes and the edges that connect those nodes. Examples of natural graphs in the real world include social networks, webpages and the links between them, transportation networks, electric grids, and more.
GNNs are an emerging family of deep learning algorithms that bring the advantages of deep learning to graphs. This enables ML problems defined on graphs to automatically take advantage of all the benefits of neural networks–such as being able to generate predictions directly from the raw graph node and edge data (including with self-supervised training approaches) without any intervening feature engineering/data flattening/aggregation, etc., transferring learnings from labeled data-rich use cases to data-sparse ones, and more.
As a side note, predictions made with graphs also have significant advantages independent of neural networks that can further improve accuracy, such as the ability to:
- Leverage the surrounding graph context for every prediction instead of treating each prediction as independent
- Ensure predictions made on connected nodes and edges are consistent with and inform each other (useful for example in crime detection where if you have multiple adjacent entities with a moderate risk of criminal activity, the fact that they have mutual interactions should further increase that predicted risk)
- …and more!
And finally, most enterprise data, even without any investment of effort/time to prepare and preprocess, already has an inherent graph structure (think primary and foreign key relationships, the proximity of events in time and space, etc.). Thus, the relational nature of your existing raw enterprise data can be directly leveraged to create the enterprise Graph upon which GNNs are applied!
The result is that from our prior experience we’ve seen multiple examples of significant benefits from GNNs applied to enterprise data that were realized in the wild. For example:
- Vanja (as CTO) and Jure (as CSO) led the application of GNNs to personalization systems at Pinterest, driving a significant lift in the performance of downstream systems ranging from content recommenders to ad targeting.
- At Linkedin, Hema led the application of GNNs to LinkedIn’s People You May Know, and a variety of other recommendation and ranking problems, also leading to significant improvements.
- Many of Kumo’s engineers have also successfully applied GNNs to personalization, abuse detection, fraud detection, and other mission-critical enterprise problems.
- Through our experience also as founders of the most popular open source framework for GNNs, Pytorch Geometric (PyG), we’ve also seen enterprises as varied as Spotify, AstraZeneca, Syntensor, Nvidia, Intel, and Graphcore all successfully apply GNNs to their use cases
…with an end to end platform as easy to use as SQL
GNNs, although potentially deeply impactful in the enterprise setting, remain extremely difficult to use in production effectively. Challenges include properly mapping enterprise data to a graph’s nodes and edges in a way that is optimized for predictions, defining training data and data splits correctly (differently from traditional supervised machine learning), choosing the right neural network architecture efficiently, scaling the graph to enterprise-scale datasets, taking advantage of the promise of transfer learning between tasks, and more.
In light of this, we decided to create a fully integrated end-to-end platform, covering all major steps in the ML lifecycle that typically take a team of data scientists months to go through–data preparation, target engineering, and training example sampling, feature engineering, model architecture search, deployment–in a way that optimally leverages GNNs (and avoids their quirks!) while simultaneously abstracting away their complexity under the hood.
And at the core of this platform, we created a new language for automatically orchestrating these entire workflows, called ‘Predictive Querying’. We designed it to look similar to SQL or even excel formulas, with the main difference being that instead of applying the query to historic data, you apply it to the future data that hasn’t even landed in your data warehouse yet.
For example, let’s say you have a typical e-commerce retailer’s data model, including dimension tables for customers and products, as well as fact tables for their purchases, views, searches, and other interaction events. In that case, here is how easy it would be for you to:
- Predict 30 day LTV for all of your customers:
entity: customers.customer_id
target: SUM(purchases.price, NEXT 30 DAYS)
- Predict what products each customer will buy over the next 7 days (for personalization)
entity: customers.customer_id
target: DISTINCT(purchases.product_id, NEXT 7 DAYS)
- Predict 90 day churn for currently active customers (at least one view in last 30 days):
entity: customers.customer_id
temporal_entity_filter: COUNT(views.*, LAST 30 DAYS) >= 1
arget: COUNT(purchases.*, NEXT 90 DAYS) = 0
This interface enables us to bring you numerous benefits that wouldn’t be possible with traditional ML platform approaches, including:
- Automatically setting up an entire ML pipeline for you from data preprocessing to production deployment (including model quality monitoring and regular retraining), and enabling you to change that entire pipeline with just small edits to the underlying query
- Incorporating (still poorly known) best practices specific to GNNs across every step of the ML workflow, enabling us to ensure you max out predictive accuracy while minimizing infra costs at all times
- Handle helping you train multiple queries together at once to obtain higher accuracy across all queries through mutual transfer learning
And most importantly, this makes it possible that our users can, through just setting up a single data integration (creating a single graph), immediately tackle generating dozens of different predictions across different use cases, all in a single sitting.
Find out more
There’s much more to our platform and its specific functionality for enabling predictive analytics across your enterprise, which we go over on our website . We’re also already working with over a dozen customers such as Whatnot and Yieldmo in pilots and seeing great early wins. And of course, if you are interested in testing out what we can do for your enterprise, please reach out to us!