Predictive Queries are mainly used to generate a training table that gets attached to an underlying graph.

Static queries that don’t involve making predictions over some window of time do not require a time column in the target table(s); in these cases, by default, Kumo generates training/validation/holdout splits by randomly distributing the rows according to an 80/10/10 ratio. While the time column is not required, it is allowed, and you may also distribute the rows according to a specific time range.

In contrast, temporal queries predict some aggregation of values over time (e.g., “purchases each customer will make over the next 7 days”) are more complex. Data splits need to be non-overlapping, properly ordered, and well-balanced in size to prevent data leakage that could invalidate the predictions. Kumo automatically handles this by splitting the data into train/test/validation splits based on the time column in your target table.

For more information about how to specify the Split you would like to use, refer to the documentation here.

Example: Predicting Customer Purchases Over 30 Days

The following Predictive Query predicts customers who will refrain from making a purchase over the next 30 days:

PQL
PREDICT COUNT(TRANSACTIONS.*, 0, 30, days) = 0
FOR EACH CUSTOMERS.CUSTOMER_ID

To generate training examples, Kumo travels back in time and “replays” user behavior at different past time points, sampling data appropriately. It then automatically determines the best sampling and training split methodology based on your dataset and Predictive Query, as depicted below:

Kumo analyzes your Predictive Query and dataset to determine the optimal sampling rates and splits. For temporal queries, Kumo ensures:

  • The holdout split occurs strictly later in time than the training split.

  • The training splits are balanced in size.

This process ensures optimal model performance and eliminates errors from manual training split setup.


Example: Predicting Total Sales Over 30 Days

The following Predictive Query predicts the total number of sales per customer in the next 30 days:

PQL
PREDICT SUM(TRANSACTIONS.PRICE, 0, 30, days)
FOR EACH CUSTOMERS.CUSTOMER_ID

For this query, Kumo generates training/validation/holdout splits based on the time range of the transactions table.

For example, if your dataset spans September 20, 2018, to September 22, 2020, Kumo will:

  • Compute 30-day user spend at various past time points.

  • Automatically generate the appropriate sampling and training split methodology.

This ensures that training examples are correctly distributed across the entire dataset while maintaining time-based integrity.