split: (ApproxDateOffsetSplit | DateOffsetSplit | RandomSplit | TemporalSplit | TimeRangeSplit) (Optional)
Description
Kumo will generate a training table with entities and labels according to your predictive query to be used in GNN training. A default split method will be used to split the generated training table to 3 disjoint train, validation and test sets for training the predictive query- When the predictive query is temporal node prediction, the default split method isApproxDateOffsetSplit([0.8, 0.1, 0.1]).
- When the predictive query is temporal link prediction, the default split method is DateOffsetSplit([-target_aggregation_end_date, 0])
- When the predictive query is static and the entity or the target table has a time column, the default split method is TemporalSplit([0.8, 0.1, 0.1]).
- When the predictive query is static and the entity and target table does not have a time column, the default split method is RandomSplit([0.8, 0.1, 0.1]).
splitvalue in the Model Planner input field.
Supported Task Types
- All
Supported Query Types
ApproxDateOffsetSplit: Only support temporal query
DateOffsetSplit: Only support temporal query
RandomSplit: Support all query types, more suitable for static queries
TemporalSplit: Support all query types, more suitable for temporal queries. The entity table needs to have a Create Date column.
TimeRangeSplit: Support all query types, more suitable for temporal queries. The entity table needs to have a Create Date column.
Methods
| Method | Purpose | Details | 
|---|---|---|
| RandomSplit([train_ratio, val_ratio, test_ratio]) | Defines a random split of the training table according to train_ratio,val_ratioandtest_ratio. | Kumo first shuffles all the training data, then selects the first int(len(train_table) * train_ratio)rows as train data, then select the firstint(len(train_table) * val_ratio)rows after removing the train data from training table as val data and select the firstint(len(train_table) * test_ratio)rows after removing both training data and validation data from training table as test data. | 
| DateOffsetSplit([val_offset, test_offset], unit) | Defines a date offset split of training table according to the (relative) offsets from the max date in training table. | Training data with anchor date larger or equal to max_date - target_aggregation_end_date + test_offsetand prediction horizon end date smaller or equal tomax_datewill be test data. Training data with anchor date larger or equal tomax_date - target_aggregation_end_date + val_offsetand prediction horizon end date smaller or equal tomax_date - target_aggregation_end_date + test_offsetwill be val data, and the rest will be train data.unitdefines the time unit ofval_offsetandtest_offset.unitis default to be the same as the one used in target aggregation.unitcan be set to bemonthsorhoursif needed. | 
| TemporalSplit([train_ratio, val_ratio, test_ratio]) | Defines a temporal split of the training table according to train_ratio,val_ratioandtest_ratio. | Kumo first sorts the training table according to its time column, then select the first int(len(train_table) * train_ratio)rows as train data, then select the firstint(len(train_table) * val_ratio)rows after removing the train data from training table as val data and select the firstint(len(train_table) * test_ratio)rows after removing both training data and validation data from training table as test data. | 
| TimeRangeSplit([(train_date_start, train_date_end), (val_data_start, val_date_end), (test_date_start, test_date_end)]) | Defines a time range split of the training table according to a list of given start and end times, train_date_start,train_date_end,val_data_start,val_date_end,test_date_startandtest_date_end. Format:YYYY-MM-DDorYYYY-MM-DDTHHMMSS | Kumo uses the three sets of time range splits for specifying the exact start/end dates of the train, valid, and test sets. The data-generating procedure (performed separately for each of the three sets) is the same as for TemporalSplitandDateOffsetSplit, with the exception thatmax_timestampin the data is ignored in favor of the user-defined end of each interval. Additionally, for cases with target aggregations using a non-zero start offset, Kumo ignores the first offset-worth of data to avoid data leakage. |