Column Preprocessing
Data Type and Semantic Type
Kumo automatically detects column types for preprocessing, but you can manually adjust them as needed. If a mismatch is detected, Kumo will provide recommendations or alert you with an invalid data type error.
Ensure that the semantic type (Type
) aligns with the data type (Data Type
) to avoid inconsistencies.
Supported Column Types
Kumo supports preprocessing for the following data types:
-
Numerical – Integers and floats where numerical ordering is meaningful (e.g., product price, discount percentage).
-
Categorical – Single-token strings or booleans with a limited set of unique values (up to 4,000 by default), such as product type or subscription status.
-
Multi-Categorical – Comma-separated lists of categorical values (e.g., restaurant tags:
"vegetarian, italian, pickup_only"
). -
ID – Unique identifiers with no numerical meaning, such as customer IDs or product group numbers.
-
Text – Multi-token strings where semantic meaning is important (e.g., product descriptions, reviews).
-
Timestamp – Date/time values in a valid format (preferably ISO 8601 or epoch time). For Parquet data, ensure timestamps are correctly cast to a
DATE/TIME/TIMESTAMP
type. -
Embedding – Lists of equal-length floats, typically representations from AI models.
Unsupported Column Types
The following column types are not supported for preprocessing in Kumo. If needed, consider transforming them before ingestion:
-
Full URLs – Extract meaningful components (e.g., domain, path) and treat them as categorical values.
-
Lat/Long Coordinates – Convert to categorical geographic areas.
-
IP Addresses – Remove PII, extract high-level details (e.g., subnet), and treat as categorical elements.
-
Phone Numbers – Remove PII, extract relevant components (e.g., area code), and treat as categorical values.
Handling Nested or Complex Data
Kumo does not support nested schemas, arrays, or maps. To use such data, transform it into a string format:
Example: Converting an array to a string
Before: ["TV", "electronics", "promotion"]
After: "TV, electronics, promotion"
Column Properties
Primary Key Column
Each row should have a unique Primary key (e.g., user_id
). If duplicate rows share the same key, only one will be retained, and the rest will be dropped.
Create Date Column
The Create date column represents when a row was created or when the data became valid. This helps define training timelines and ensures predictions use the correct time-based data.
End Date Column
The end date column restricts training and predictions to a specific timeframe.
-
For temporal tasks, training will include only data valid within this timeframe.
-
For batch predictions, only rows where the Create date is on or before the prediction time and the End date is before the prediction time will be used.
Example Use Case: End Date for Product Availability If a product goes out of stock on a particular date, set End date to the column tracking this date.