Problem | Type |
---|---|
Age Prediction | Regression |
Prediction of Movie Ratings | Regression |
Lifetime Value | Temporal Regression |
Active Purchasing Customer LTV | Temporal Regression |
Active Purchasing Customer Churn | Temporal Binary Classification |
Fraud Detection | Temporal Binary Classification |
Next Item Category Prediction | Temporal Multiclass Classification |
Probability of Liking an Item | Non-Temporal Classification |
Item to Item Similarity | Static Link Prediction |
Top 25 Most Likely Purchases | Temporal Link Prediction |
Item Recommendation | Temporal Link Prediction |
Top 25 “High Value” Purchases | Temporal Classification/Ranking |
customers
: containing one row for each of your customers.transactions
: containing one row for each transaction (purchase) made by each customer.articles
: information about the products that the customer purchased.age
is a number. When the query is executed and a RandomSplit
is used Kumo will randomly group the customers
into train/val/test splits according to an 80/10/10 ratio. Model training will happen on the train and val splits, while evaluation will happen on the test split.
user
: profile details like userId, age, gender, and other relevant features.movie
: information such as movieId, genre, director, release year, and additional attributes.rating
: contains ratingId, userId, movieId, and rating (1-5). Ratings are explicit if present; if null, they represent prediction opportunities for the model.PREDICT
function refers to the column to be predicted. The FOR EACH
clause indicates your intention to create predictions for each individual rating entry (combination of user and movie) in the data set.
You employ the subset of data where users have rated movies to train your model, enabling it to discern the intricate relationships between users’ characteristics, their past rating patterns, the attributes of the movies, and the given ratings.
Prediction Output
The end result is a structured table, where each row represents a distinct pairing of a user and a movie, together with a ratingId
, complemented by a rating
. This rating is the model’s prediction of how likely a user is to appreciate a movie, informed by the historical data and the inferred preferences from users with similar tastes and movie profiles.
CUSTOMERS
and the target is the SUM
of TRANSACTIONS
over the next 30 days.
Data
This predictive query is running on the following data schema:
customers
: containing one row for each of your customers.transactions
: containing one row for each transaction (purchase) made by each customer.articles
: information about the products that the customer purchased.SUM()
, Kumo interprets this as a temporal predictive query. This means that Kumo will generate a set of train/val/test splits that cover the time range of the target table transactions
. In this case, the target table has a time range of Sep 20, 2018
to Sep 22, 2020
. Training examples will be generated by computing the 30-day spend of users, at various points in time during this time range.
This table will be used as labels for training a GNN model. Because the target of this query SUM()
is a real-valued number, the task type of this query is regression
, and the platform will export the standard evaluation metrics for regression, as described here.
Prediction Output
The batch prediction output is a table with two columns: customer_id
, and predicted SUM(transactions.price, 0, 30)
. Predictions will be made for all customers in the customers
table, assuming an anchor time of September 22, 2020
(unless otherwise specified). This means that the model will predict the total spend of each customer, from September 22, 2022
until October 20, 2022
.
customers
: containing one row for each of your customers.transactions
: containing one row for each transaction (purchase) made by each customer.articles
: information about the products that the customer purchased.September 22, 2020
.
customers
: containing one row for each of your customers.transactions
: containing one row for each transaction (purchase) made by each customer.articles
: information about the products that the customer purchased.PREDICT COUNT() = 0
is an true/false prediction per customer, Kumo treats this as a binary classification task. Because COUNT
is an aggregation over time, this is treated as a temporal query, so it follow the same train table generation procedure that was described in the LTV example.
Prediction Output
The batch prediction output is a table with two columns: customer_id
, and a score containing the probability of COUNT(transactions.*, 0, 90, days) = 0
. A prediction is made for every single customer that matches the WHERE
condition at the anchor_time
of batch prediction, which is September 22, 2020
by default.
Users
: profile details like User ID, age, gender, and other relevant features.Transaction
: information such as transaction ID, Fraud report ID, transaction type, transaction date, transaction, amount, ID of the sender and receiver, and additional attributes.Fraud Reports
: fraud Report ID, report timestamp, the label, and other possible information.WHERE transaction.type = “bank transfer”
will drop any training samples where the value of column type in table transaction is not “bank transfer”
.
TemporalSplit
according to the time column of each entity (i.e., transaction). When making a prediction, the predictive model will only access data with smaller or equal timestamps.
Prediction Output
The end result is a structured table, where each row represents a transaction without a fraud report and its corresponding predicted fraud label, complemented by a probability of a true and a false label. These scores are the model’s prediction of how likely a transaction is to be fraudulent, informed by the historical transactions and their labels. Transactions that do not meet the condition transaction.type = “bank transfer”
will not be included.
Users
: profile details like User ID, age, gender, and other relevant features.Purchase
: information such as User ID, Item ID, purchase date, purchase value, purchase type, and additional attributes.Items
: item ID, item description, item size, and other relevant features.FIRST
) over the next seven days. When generating the training data, Kumo will first determine which past timestamps to generate the training data for. These are so-called “anchor times”. In this case, it will take a timestamp every seven days from the latest timestamp in the purchase history to the earliest. So, if the last purchase was made on 2019-12-31
, the following anchor times will be generated: 2019-12-24
, 2019-12-17
, 2019-12-10
, …
For each pair (user, anchor time) a label will be computed, that is, the first purchase type that this user made in the seven days following the anchor time.
FIRST
aggregation is undefined if a user made no purchases in the seven days following the anchor time, such examples are automatically dropped unless you use the IS NULL
filter. All future predictions are thus made under the assumption that a user will buy something. This is different from a COUNT
aggregation, where “no purchases” would simply be given value 0
.ApproxDateOffsetSplit
, which will put the final seven days of data (i.e., the last anchor time) into the test set, the second-to-last seven days into validation, and the rest into the training set. It is important to be mindful of this when defining the aggregation horizon. Using a very large prediction horizon might result in insufficient data for the split—at least 3 anchor times are required for a healthy train/val/test split.
On the other hand, using a very small aggregation window on a dataset with a long history might result in far too many generated examples, slowing down the training. To prevent this, use the train_start_offset
parameter in the model plan. For example, train_start_offset: 300
means that only the last 300 days are used for label generation.
During the training, model will only be able to access rows in the database with a timestamp smaller than the timestamp of the training sample in question, guaranteeing no data leakage.
Prediction Output
When making predictions for the future, you need to specify the batch prediction anchor time
—the point in time from which the prediction is made. By default, this will be the largest timestamp in the purchase table. The predictions are then made for one aggregation (seven days) from that point onward, one for each entity.
The end result is a structured table, where each row represents a prediction for a user with their corresponding predicted next purchase category, complemented by a probability of each class. These scores are the model’s prediction of how likely the user is to buy from that category next.
customer_item_pairs
which contains the “likes” signal. customer_item_pairs.likes
would be a binary column with possible values of true
, false
, and null
.
customer_item_pairs
does not contain a time column, this is a non-temporal query. Because this query is not temporal, Kumo will generate train/val/test splits by randomly distributing the rows of the customer_item_pairs
table according to an 80/10/10 ratio. The label of the training table is defined by customer_item_pairs.likes = 1
, and all rows where likes IS NULL
are dropped.
Prediction Output
The batch prediction output is a table with two columns: customer_item_pairs.uuid
, and a score containing the probability that customer_item_pairs.likes = 1
. A prediction is made for every single uuid
where customer_item_pairs.likes is null
. In simpler terms, it makes a prediction for every single value that is missing.
source item
tabledestination item
tableco-purchase
table between different items (e.g., pre-processed via Kumo views based on past purchase data)LIST_DISTINCT
), without any window size, referring to a static link prediction task. The RANK TOP 10
clause dictates that we are interested in exactly 10 similar items per item. The generated data will by default be split with a RandomOffsetSplit
, into 80% training, 10% validation and 10% test co-purchases.
For static link prediction tasks, two different module types are exposed:
handle_new_target_entities
model planner option. By default, Kumo assumes a transductive link prediction setting for best model performance. Inductive link prediction usually results in a bit lower model performance (since you cannot train corresponding lookup embeddings), but may be inevitable depending on the use-case.
In order to emphasize the learning of cold start items, you may additionally consider discarding the co-purchase table from being used inside the model. This can be achieved via the max_target_neighbors_per_entity: 0
option.
Prediction Output
The end result is a structured table, where each row represents a missing co-purchase item pair, complemented by a score of this (user, user) pair. These scores are the model’s prediction of how likely the two items are co-purchased together. Given 10 recommendations per items, the final output will contain 10 times #src_items many rows.
customers
: containing one row for each of your customers.transactions
: containing one row for each transaction (purchase) made by each customer.articles
: information about the products that the customer purchased.LIST_DISTINCT
and RANK
result in a discrete list of articles, Kumo infers that this is a link prediction task. Intuitively, you are predicting which article_id
s are most likely to have transactions
for each user, in the next 30 days.
As LIST_DISTINCT
is an aggregation over time, this predictive query is a temporal task, and follows the same training process as described in the LTV example.
Prediction Output
The batch prediction output is a table with two columns: customer_id
and PREDICTED
, which contains the 25 articles that each customer is most likely to purchase in the next 30 days.
customers
: containing one row for each of your customers.transactions
: containing one row for each transaction (purchase) made by each customer.articles
: information about the products that the customer purchased.LIST_DISTINCT
) over the next seven days. When generating the training data, Kumo will first determine which past timestamps to generate the training data for. These are so-called “anchor times”. In this case, it will take a timestamp every seven days from the latest timestamp in the purchase history to the earliest. So, if the last purchase was made on 2019-12-31
, the following anchor times will be generated: 2019-12-24
, 2019-12-17
, 2019-12-10
, … For each pair (user, anchor time) a list of item purchases will be computed. The RANK TOP 10
clause dictates that we are interested in exactly 10 item recommendations per user.
The generated data will by default be split with a DateOffsetSplit()
, which will put the final seven days of data (i.e. the last anchor time) into the test set, the second-to-last 7 days into validation, and the rest into the training set. It is important to be mindful of the window size in link prediction tasks. In particular, a small aggregation window may result in too few positives.
For temporal link prediction tasks, two different module types are exposed:
handle_new_target_entities
model planner option. By default, Kumo assumes a transductive link prediction setting for best model performance. Inductive link prediction usually results in a bit lower model performance (since we cannot train corresponding lookup embeddings), but may be inevitable depending on the use-case.
Prediction Output
When making predictions for the future, you need to specify the “batch prediction anchor time”—the point in time from which the prediction is made. By default, this will be the largest timestamp in the purchase table. The predictions are then made for one aggregation (7 days) from that point onward, one for each entity.
The end result is a structured table, where each row represents a recommended item for a user, complemented by a score of this (user, item) pair. These scores are the model’s prediction of how likely the user is to purchase the item in the next seven days. Given 10 recommendations per user, the final output will contain 10 times #users many rows.
customers
: containing one row for each of your customers.transactions
: containing one row for each transaction (purchase) made by each customer.articles
: information about the products that the customer purchased.WHERE
clause within LIST_DISTINCT
. This means that the training examples will only be created for transactions with price > 100
.
Prediction Output
The batch prediction output is a table with two columns: customer_id
and PREDICTED
, which contains the 25 articles, with price > 100
, that each customer is most likely to purchase in the next 30 days.