split: (ApproxDateOffsetSplit | DateOffsetSplit | RandomSplit | TemporalSplit | TimeRangeSplit)
(Optional)ApproxDateOffsetSplit([0.8, 0.1, 0.1])
.DateOffsetSplit([-target_aggregation_end_date, 0])
TemporalSplit([0.8, 0.1, 0.1])
.RandomSplit([0.8, 0.1, 0.1])
.split
value in the Model Planner input field.
ApproxDateOffsetSplit
: Only support temporal query
DateOffsetSplit
: Only support temporal query
RandomSplit
: Support all query types, more suitable for static queries
TemporalSplit
: Support all query types, more suitable for temporal queries. The entity table needs to have a Create Date column.
TimeRangeSplit
: Support all query types, more suitable for temporal queries. The entity table needs to have a Create Date column.
Method | Purpose | Details |
---|---|---|
RandomSplit([train_ratio, val_ratio, test_ratio]) | Defines a random split of the training table according to train_ratio ,val_ratio and test_ratio . | Kumo first shuffles all the training data, then selects the first int(len(train_table) * train_ratio) rows as train data, then select the first int(len(train_table) * val_ratio) rows after removing the train data from training table as val data and select the first int(len(train_table) * test_ratio) rows after removing both training data and validation data from training table as test data. |
DateOffsetSplit([val_offset, test_offset], unit) | Defines a date offset split of training table according to the (relative) offsets from the max date in training table. | Training data with anchor date larger or equal to max_date - target_aggregation_end_date + test_offset and prediction horizon end date smaller or equal to max_date will be test data. Training data with anchor date larger or equal to max_date - target_aggregation_end_date + val_offset and prediction horizon end date smaller or equal to max_date - target_aggregation_end_date + test_offset will be val data, and the rest will be train data. unit defines the time unit of val_offset and test_offset . unit is default to be the same as the one used in target aggregation. unit can be set to be months or hours if needed. |
TemporalSplit([train_ratio, val_ratio, test_ratio]) | Defines a temporal split of the training table according to train_ratio , val_ratio and test_ratio . | Kumo first sorts the training table according to its time column, then select the firstint(len(train_table) * train_ratio) rows as train data, then select the firstint(len(train_table) * val_ratio) rows after removing the train data from training table as val data and select the first int(len(train_table) * test_ratio) rows after removing both training data and validation data from training table as test data. |
TimeRangeSplit([(train_date_start, train_date_end), (val_data_start, val_date_end), (test_date_start, test_date_end)]) | Defines a time range split of the training table according to a list of given start and end times, train_date_start , train_date_end , val_data_start ,val_date_end , test_date_start and test_date_end . Format: YYYY-MM-DD or YYYY-MM-DDTHHMMSS | Kumo uses the three sets of time range splits for specifying the exact start/end dates of the train, valid, and test sets. The data-generating procedure (performed separately for each of the three sets) is the same as for TemporalSplit and DateOffsetSplit , with the exception that max_timestamp in the data is ignored in favor of the user-defined end of each interval. Additionally, for cases with target aggregations using a non-zero start offset, Kumo ignores the first offset-worth of data to avoid data leakage. |