AWS S3
Kumo supports reading CSV and Apache Parquet files from AWS S3. The connector is configured to read from a specific top-level directory, which can contain either files or second-level directories. Each file or second-level directory represents a single table in Kumo.
Directory Configuration
-
If the top-level directory contains files, Kumo will ignore any subdirectories and ingest only the files as tables.
-
If the top-level directory contains subdirectories, each second-level directory will be treated as a separate table, subject to the following rules:
-
All files within a subdirectory must be of the same type (either CSV or Parquet).
-
All files must share the same column schema (for CSV, this means identical headers).
-
Subdirectories must not contain additional nested directories—only CSV or Parquet files are allowed.
-
The resulting table will combine all rows from all files in the subdirectory.
-
Example Storage Structure
Here’s an example of a second-level storage structure for reference:
For optimal performance, please limit the number of tables from the root directory to 30.
Data Format and Size Considerations
Parquet Files
-
Ensure columns are cast to their proper data types (e.g., timestamp for dates) to streamline data ingestion and validation.
-
Hive partitioned tables are not supported.
-
The total data size across all partitions must not exceed 512MB.
CSV Files
-
A CSV table can have up to 128 partitions.
-
The total data size across all partitions must not exceed 10GB.
Converting your dataset to Parquet is recommended whenever your dataset is large.
Granting Access
To allow Kumo to access your data, update your S3 bucket policy by replacing:
-
<data bucket name>
with your actual S3 bucket name. -
{% $customerId %}
with your company name.
For step-by-step instructions, refer to AWS’s Add Bucket Policy guide or reach out to Kumo Support for assistance.
Once the policy is applied, share the details with Kumo Support to finalize access configuration.