While Kumo automatically infers encoders based on each column’sDocumentation Index
Fetch the complete documentation index at: https://kumo.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
dtype and stype, you can override the encoder for individual columns via ColumnProcessingPlan. The encoder you specify must be compatible with the column’s semantic type.
Enums
NAStrategy
Strategy for imputing missing values.
| Value | Description |
|---|---|
ZERO | Fill missing values with zero. |
MEAN | Fill missing values with the column mean. |
SEPARATE | Treat missing values as a separate category. |
MOST_FREQUENT | Fill with the most frequent value. |
Scaler
Scaling strategy for numerical features.
| Value | Description |
|---|---|
STANDARD | Z-score normalization (equivalent to scikit-learn StandardScaler). |
MINMAX | Min-max scaling to [0, 1] (equivalent to MinMaxScaler). |
ROBUST | Robust scaling using median and IQR (equivalent to RobustScaler). |
Encoders
Null
Skips encoding for the column entirely. Supported on all semantic types.
Numerical
Encodes a numerical column with optional scaling and missing value imputation.
The scaling strategy. Kumo infers a suitable scaler if not specified.
The missing value imputation strategy.
MaxLogNumerical
Applies a log transformation after clipping values at the column maximum. Useful for heavy-tailed distributions.
The missing value imputation strategy.
MinLogNumerical
Applies a log transformation after clipping values at the column minimum.
The missing value imputation strategy.
Index
Encodes a categorical or ID column as a learned embedding index. Rare values (below min_occ) are mapped to a shared out-of-vocabulary embedding.
Minimum occurrence count for a value to receive its own embedding. Values appearing fewer times are treated as out-of-vocabulary.
The missing value imputation strategy.
Hash
Encodes a categorical column via feature hashing. Suitable for very high-cardinality columns.
The number of hash buckets (output dimensionality).
The missing value imputation strategy.
MultiCategorical
Encodes a multi-categorical column (a string containing multiple space- or comma-separated categories) as a bag-of-categories embedding.
Minimum occurrence count for a category to receive its own embedding.
The missing value imputation strategy.
GloVe
Encodes a text column using pre-trained GloVe word embeddings.
The GloVe model identifier.
The embedding dimensionality.
The missing value imputation strategy.
NumericalList
Encodes a sequence or embedding column (a list of floats) with optional scaling.
The scaling strategy applied to each element.
The missing value imputation strategy.
Datetime
Encodes a timestamp column by decomposing it into cyclical calendar features. Each component is optional and enabled by default.
Include the minute-of-hour component.
Include the hour-of-day component.
Include the day-of-week component.
Include the day-of-month component.
Include the day-of-year component.
Include the year component.
If specified, encodes year as a cyclical feature over this many periods.
The missing value imputation strategy.