Kumo supports the ingestion of various data source types by allowing you to configure connectors for the following:
AWS S3 (CSV or Parquet files)
Snowflake - Tables and Views
Databricks - Tables and Views
Google Cloud BigQuery - Tables
You also have the option of uploading a local file (CSV or Parquet files less than 1 GB) for ingestion into Kumo. In this case, you can skip connector creation and create a Kumo table directly by selecting Local File Upload.In terms of data preprocessing, Kumo automatically preprocesses several data types when creating your Kumo table columns, including:
Numerical: Integers, floats
Categorical: Boolean or string values typically a single token in length
Text: String values typically multiple tokens in length, where the actual language content of the value has semantic meaning
Multi-categorical: Concatenation of multiple categories under a single string representation
ID: Numerical values used to uniquely identify different entities
Timestamp: Time/date information (for extracting year/month/date/hour/minute when applicable)
Embeddings: Consist of lists of floats, all of equal length, and are typically the output of another AI model.
Column types are automatically detected using heuristics on the distribution of values in each column’s data, and can also be manually configured.