- AWS S3 (CSV or Parquet files)
- Snowflake - Tables and Views
- Databricks - Tables and Views
- Google Cloud BigQuery - Tables
- Numerical: Integers, floats
- Categorical: Boolean or string values typically a single token in length
- Text: String values typically multiple tokens in length, where the actual language content of the value has semantic meaning
- Multi-categorical: Concatenation of multiple categories under a single string representation
- ID: Numerical values used to uniquely identify different entities
- Timestamp: Time/date information (for extracting year/month/date/hour/minute when applicable)
- Embeddings: Consist of lists of floats, all of equal length, and are typically the output of another AI model.
- Column types are automatically detected using heuristics on the distribution of values in each column’s data, and can also be manually configured.