Reducing the Time Range of Training Data
By default, Kumo will use all data in your graph for training. So, if you have 10 years of event logs, Kumo will use all 10 years of data. This certainly increase computation cost at training time, and is often not necessary during early model development. For example, if there are enough training examples in the past 2 years of data, then it will be faster to develop the model on 2 years of data, rather than 10. In order to reduce the range of training data, you can use the following options:- train_start_offset: control the number of days of training data to generate
- TimeRangeSplit: Allows you to control the exact time range of the data in the training/validation/holdout set.
Downsample your input data
If you have particularly large input data, you may want to downsample your data further, even before connecting it to Kumo. For example, if you are using Snowflake as a data source, you can create a snowflake view for each of your input tables, filtering it down to specific time range (eg. 1 year of data), or randomly sampling data for a subset of users. Note, when downsampling your data, you must have have a principled sampling approach. For example, you cannot randomly select 10% of rows from all of your tables, as this will lead to a lot of “incomplete” entries in your data, such as users that are missing transaction histories. It is recommended to downsample by the entity of your predictive query, such as “all rows in all tables whereuser_id % 10 == 1
”.