Hash
encoder if the column represents a high-cardinality identifier like product_code, Datetime
encoder if it represents a Unix timestamp, Numerical
encoder if it represents a quantity such as num_visits, or Index
encoder if it represent a boolean True/False
value. The Kumo AutoML algorithm fully automates this process for you—for all possible input data types—including text, numbers, categories, strings, and even arrays.
TimeRangeSplit
module to specify the exact holdout dataset is common practice for comparing model performance against an existing model trained outside of Kumo. You can also enforce additional constraints required by your organization (e.g., ensuring a sufficiently large gap between the training dataset and the holdout dataset).refit
option, which trains over the entire dataset, at the cost of losing evaluation metrics on the holdout dataset.0
, or enable a more expensive NLP encoding method for a particular text column that you feel is particularly important.tune_metric
option to change the behavior. This is particularly useful in the case of recommendation problems, where you can use the module
option to optimize the recommender for different goals (such as diversity vs recall).