Documentation Index
Fetch the complete documentation index at: https://kumo.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Description
column_processing configures how Kumo encodes and imputes columns.
na_strategy (Optional)
Use na_strategy to set global N/A handling by semantic type.
Supported Task Types
- All
Example
This example appliesna_strategy="separate" to all numerical columns and na_strategy="most_frequent" to categorical columns.
Supported Semantic Types
numericalcategoricalIDmulticategoricaltexttimestampsequence
encoder_overrides (Optional)
The encoder_overrides field allows you to configure the way Kumo processes your input data and override the encoders that are inferred by Kumo.
Supported Task Types
- All
Example
This example overrides three of the columns inMOVIES table to change the default behavior.
- For text column
MOVIES.overview, it uses the model"glove.42B"instead of the default"glove.6B". See “GloVe Argument Combinations” section below on this page for available options, and see GloVe project for details. - Kumo ignores the column
MOVIES.tag_line, and the column has no impact on modelling as it is set toNull(). - For numerical column
MOVIES.budget, Kumo scales numerical values within the range[0, 1]as it is set to"minmax"instead of the default"standard".
Precedence
Kumo appliescolumn_processing overrides in this order:
na_strategyglobal semantic-type overrides.encoder_overridesper-column overrides.
encoder_overrides wins.
Supported Encoders
| Column Type | Encoder | Argument | Default | Supported Value | Description |
|---|---|---|---|---|---|
| Categorical, ID | Index | na_strategy | "separate" | "zero"``"separate"``"most_frequent" | When set to "zero", embeddings for missing values are represented as zero vectors. When set to "separate", missing values are treated as a distinct category. When set to "most_frequent", missing values are assigned to the most prevalent category. |
| Categorical, ID | Index | min_occ | 1 | positive integer | The minimal count to allow within each category. If a category count is lower than min_occ, Kumo treats the category as N/A. |
| Categorical, ID | Hash | na_strategy | "separate" | "zero"``"separate"``"most_frequent" | When set to "zero", embeddings for missing values are represented as zero vectors. When set to "separate", missing values are treated as a distinct category. When set to "most_frequent", missing values are assigned to the most prevalent category. |
| Categorical, ID | Hash | num_components | Depends on cardinality of the column | positive integer | The capacity of the hash table. |
| Categorical, ID | Hash | min_occ | Depends on cardinality of the column | positive integer | The minimal count to allow within each category. If a category count is lower than min_occ, Kumo treats the category as N/A. |
| Categorical, ID | Hash | na_strategy | "zero" | "zero"``"separate"``"most_frequent" | When set to "zero", embeddings for missing values are represented as zero vectors. When set to "separate", missing values are treated as a distinct category. When set to "most_frequent", missing values are assigned to the most prevalent category. |
| Multicategorical | MultiCategorical | min_occ | 1 | positive integer | The minimal count to allow within each category. If a category count is lower than min_occ, Kumo treats the category as N/A. |
| Multicategorical | MultiCategorical | sep | Inferred by Kumo | string | The separator to use. |
| Numerical | Numerical | scaler | None | None``"standard"``"minmax"``"robust" | When set to None, no transformation is applied to the column values. When set to "standard", the column values are transformed to have zero mean and unit variance. When set to "minmax", the values are scaled to fall within the range [0, 1]. When set to "robust", values are subtracted from the feature’s median and divided by the interquartile range. |
| Numerical | Numerical | na_strategy | "mean" | "mean"``"zero"``"separate" | If "mean", N/A values are replaced with the mean value of the column. If "zero", N/A values are replaced with zero. If "separate", missingness is preserved as a dedicated learnable signal. |
| Numerical | MaxLogNumerical | na_strategy | "mean" | "mean"``"zero" | If "mean", N/A values are replaced with the mean value of the column. If "zero", N/A values are replaced with zero. |
| Numerical | MinLogNumerical | na_strategy | "mean" | "mean"``"zero" | If "mean", N/A values are replaced with the mean value of the column. If "zero", N/A values are replaced with zero. |
| Embedding | NumericalList | na_strategy | "zero" | "zero" | If "zero", N/A values are replaced with zero. |
| Timestamp | Datetime | include_minute | true | true``false | Whether to include minute. |
| Timestamp | Datetime | include_hour | true | true``false | Whether to include hour. |
| Timestamp | Datetime | include_day_of_week | true | true``false | Whether to include day of week. |
| Timestamp | Datetime | include_day_of_month | true | true``false | Whether to include day of month. |
| Timestamp | Datetime | include_day_of_year | true | true``false | Whether to include day of year. |
| Timestamp | Datetime | include_year | true | true``false | Whether to include year. |
| Timestamp | Datetime | num_year_periods | Depends on the difference between the min and max year in the column | positive integer | The number of periods to consider for encoding years, e.g., in case num_year_periods=4, year is encoded as year % i for each i in { 2, 4, 8, 16 }. If set to None, it will be inferred based on dataset statistics. |
| Text | GloVe | model_name | "glove.6B" | "glove.6B"``"glove.42B"``"glove.840B"``"glove_twitter.27B" | The pretrained model name. |
| Text | GloVe | embedding_dim | 50 | 25``50``100``200``300 | The embedding dimension of the pretrained model. Note that not all models support these embedding dimensions. See the GloVe Argument Combinations table below. |
| Any type | Null | n/a | n/a | n/a | If Null is specified to a column, Kumo ignores this column completely. |
GloVe Argument Combinations
model_name | embedding_dim |
|---|---|
"glove.6B" | 50, 100, 200, 300 |
"glove.42B" | 300 |
"glove.840B" | 300 |
"glove_twitter.27B" | 25, 50, 100, 200 |