Column Preprocessing

Data Type and Semantic Type

Kumo automatically detects column types for preprocessing, but you can manually adjust them as needed. If a mismatch is detected, Kumo will provide recommendations or alert you with an invalid data type error. Ensure that the semantic type (Type) aligns with the data type (Data Type) to avoid inconsistencies.

Supported Column Types

Kumo supports preprocessing for the following data types:

Numerical – Integers and floats where numerical ordering is meaningful (e.g., product price, discount percentage).
Categorical – Single-token strings or booleans with a limited set of unique values (up to 4,000 by default), such as product type or subscription status.
Multi-Categorical – Comma-separated lists of categorical values (e.g., restaurant tags: "vegetarian, italian, pickup_only").
ID – Unique identifiers with no numerical meaning, such as customer IDs or product group numbers.
Text – Multi-token strings where semantic meaning is important (e.g., product descriptions, reviews).
Timestamp – Date/time values in a valid format (preferably ISO 8601 or epoch time). For Parquet data, ensure timestamps are correctly cast to a DATE/TIME/TIMESTAMP type.
Embedding – Lists of equal-length floats, typically representations from AI models.

Unsupported Column Types

The following column types are not supported for preprocessing in Kumo. If needed, consider transforming them before ingestion:

Full URLs – Extract meaningful components (e.g., domain, path) and treat them as categorical values.
Lat/Long Coordinates – Convert to categorical geographic areas.
IP Addresses – Remove PII, extract high-level details (e.g., subnet), and treat as categorical elements.
Phone Numbers – Remove PII, extract relevant components (e.g., area code), and treat as categorical values.

Handling Nested or Complex Data

Kumo does not support nested schemas, arrays, or maps. To use such data, transform it into a string format: Example: Converting an array to a string Before: ["TV", "electronics", "promotion"] After: "TV, electronics, promotion"

Column Properties

Primary Key Column

Each row should have a unique Primary key (e.g., user_id). If duplicate rows share the same key, only one will be retained, and the rest will be dropped.

Create Date Column

The Create date column represents when a row was created or when the data became valid. This helps define training timelines and ensures predictions use the correct time-based data.

End Date Column

The end date column restricts training and predictions to a specific timeframe.

For temporal tasks, training will include only data valid within this timeframe.
For batch predictions, only rows where the Create date is on or before the prediction time and the End date is before the prediction time will be used.

Example Use Case: End Date for Product Availability If a product goes out of stock on a particular date, set End date to the column tracking this date.

Get Started

Connect Data

Train Model

Run Models

Admin & Setup

Column Preprocessing

Data Type and Semantic Type

Supported Column Types

Unsupported Column Types

Handling Nested or Complex Data

Column Properties

Primary Key Column

Create Date Column

End Date Column

Get Started

Connect Data

Train Model

Run Models

Admin & Setup

​Data Type and Semantic Type

​Supported Column Types

​Unsupported Column Types

​Handling Nested or Complex Data

​Column Properties

​Primary Key Column

​Create Date Column

​End Date Column

Data Type and Semantic Type

Supported Column Types

Unsupported Column Types

Handling Nested or Complex Data

Column Properties

Primary Key Column

Create Date Column

End Date Column