Skip to main content

Documentation Index

Fetch the complete documentation index at: https://kumo.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

While Kumo automatically infers encoders based on each column’s dtype and stype, you can override the encoder for individual columns via ColumnProcessingPlan. The encoder you specify must be compatible with the column’s semantic type.

Enums

NAStrategy

Strategy for imputing missing values.
ValueDescription
ZEROFill missing values with zero.
MEANFill missing values with the column mean.
SEPARATETreat missing values as a separate category.
MOST_FREQUENTFill with the most frequent value.

Scaler

Scaling strategy for numerical features.
ValueDescription
STANDARDZ-score normalization (equivalent to scikit-learn StandardScaler).
MINMAXMin-max scaling to [0, 1] (equivalent to MinMaxScaler).
ROBUSTRobust scaling using median and IQR (equivalent to RobustScaler).

Encoders

Null

Skips encoding for the column entirely. Supported on all semantic types.
from kumoai.encoder import Null

encoder = Null()

Numerical

Encodes a numerical column with optional scaling and missing value imputation.
from kumoai.encoder import Numerical, Scaler, NAStrategy

encoder = Numerical(scaler=Scaler.STANDARD, na_strategy=NAStrategy.MEAN)
scaler
Optional[Scaler]
default:"None"
The scaling strategy. Kumo infers a suitable scaler if not specified.
na_strategy
NAStrategy
default:"NAStrategy.MEAN"
The missing value imputation strategy.

MaxLogNumerical

Applies a log transformation after clipping values at the column maximum. Useful for heavy-tailed distributions.
from kumoai.encoder import MaxLogNumerical, NAStrategy

encoder = MaxLogNumerical(na_strategy=NAStrategy.MEAN)
na_strategy
NAStrategy
default:"NAStrategy.MEAN"
The missing value imputation strategy.

MinLogNumerical

Applies a log transformation after clipping values at the column minimum.
from kumoai.encoder import MinLogNumerical, NAStrategy

encoder = MinLogNumerical(na_strategy=NAStrategy.MEAN)
na_strategy
NAStrategy
default:"NAStrategy.MEAN"
The missing value imputation strategy.

Index

Encodes a categorical or ID column as a learned embedding index. Rare values (below min_occ) are mapped to a shared out-of-vocabulary embedding.
from kumoai.encoder import Index, NAStrategy

encoder = Index(min_occ=2, na_strategy=NAStrategy.SEPARATE)
min_occ
int
default:"1"
Minimum occurrence count for a value to receive its own embedding. Values appearing fewer times are treated as out-of-vocabulary.
na_strategy
NAStrategy
default:"NAStrategy.SEPARATE"
The missing value imputation strategy.

Hash

Encodes a categorical column via feature hashing. Suitable for very high-cardinality columns.
from kumoai.encoder import Hash

encoder = Hash(num_components=128)
num_components
int
required
The number of hash buckets (output dimensionality).
na_strategy
NAStrategy
default:"NAStrategy.SEPARATE"
The missing value imputation strategy.

MultiCategorical

Encodes a multi-categorical column (a string containing multiple space- or comma-separated categories) as a bag-of-categories embedding.
from kumoai.encoder import MultiCategorical

encoder = MultiCategorical(min_occ=1)
min_occ
int
default:"1"
Minimum occurrence count for a category to receive its own embedding.
na_strategy
NAStrategy
default:"NAStrategy.ZERO"
The missing value imputation strategy.

GloVe

Encodes a text column using pre-trained GloVe word embeddings.
from kumoai.encoder import GloVe

encoder = GloVe(model_name="glove-wiki-gigaword-50", embedding_dim=50)
model_name
str
required
The GloVe model identifier.
embedding_dim
int
default:"50"
The embedding dimensionality.
na_strategy
NAStrategy
default:"NAStrategy.ZERO"
The missing value imputation strategy.

NumericalList

Encodes a sequence or embedding column (a list of floats) with optional scaling.
from kumoai.encoder import NumericalList

encoder = NumericalList()
scaler
Optional[Scaler]
default:"None"
The scaling strategy applied to each element.
na_strategy
NAStrategy
default:"NAStrategy.ZERO"
The missing value imputation strategy.

Datetime

Encodes a timestamp column by decomposing it into cyclical calendar features. Each component is optional and enabled by default.
from kumoai.encoder import Datetime

encoder = Datetime(include_year=True, include_day_of_week=True)
include_minute
bool
default:"True"
Include the minute-of-hour component.
include_hour
bool
default:"True"
Include the hour-of-day component.
include_day_of_week
bool
default:"True"
Include the day-of-week component.
include_day_of_month
bool
default:"True"
Include the day-of-month component.
include_day_of_year
bool
default:"True"
Include the day-of-year component.
include_year
bool
default:"True"
Include the year component.
num_year_periods
Optional[int]
default:"None"
If specified, encodes year as a cyclical feature over this many periods.
na_strategy
NAStrategy
default:"NAStrategy.ZERO"
The missing value imputation strategy.