Skip to main content
Once you’ve initialized the SDK, the first step to working with your data is defining a connector to your source tables. The Kumo SDK supports creating connectors to data on Amazon S3 with a S3Connector, Snowflake with a SnowflakeConnector, or Databricks with a DatabricksConnector. Here, we work with data on S3, but equivalent steps can be taken with other supported data warehouses. Connecting multiple tables across multiple connectors is supported (for example, you can use S3 and Snowflake together).
If you are using the Kumo Snowpark Container Services edition, only SnowflakeConnector is supported.

Creating a Connector

Creating a connector to a dataset on S3 is as simple as specifying the root directory of your data:
connector = kumo.S3Connector(root_dir="s3://kumo-public-datasets/customerltv_mini/")
Tables can be accessed with Python indexing semantics, or with the table() method:
# Access the 'customer' table by indexing into the connector:
customer_src = connector['customer']

# Access the 'transaction' table by explicitly calling the `.table`
# method on the connector:
transaction_src = connector.table('transaction')

# Create a connector without a root directory, and obtain a table by
# passing the full table path:
stock_src = kumo.S3Connector().table('s3://kumo-public-datasets/customerltv_mini/stock')

Inspecting Source Tables

The tables customer_src, transaction_src and stock_src are objects of type SourceTable, which support basic operations to verify the types and raw data you have connected to Kumo. Some examples include viewing a sample of the source data (as a pandas.DataFrame) or viewing the source columns and their data types:
print(customer_src.head())
>>
    CustomerID
428    16909.0
312    14002.0
306    17101.0
141    13385.0
273    14390.0

print(len(transaction_src.columns))
>> 8
For tables with semantically meaningful text columns, Kumo supports a language model integration that allows for modeling to utilize powerful large language model embeddings, e.g. from OpenAI’s GPT. Please see add_llm() for more details.

Data Transformations

Alongside viewing source table raw data, you can additionally perform data transformations with your own data platform directly alongside the Kumo SDK. For example, with pyspark:
from pyspark.sql.functions import col

root_dir = "s3://kumo-public-datasets/customerltv_mini/"
output_dir = ...  # An output directory that you can write to

# Perform transformation with Spark
spark.read.parquet(f"{root_dir}/transaction") \
    .withColumn("TotalPrice", col("Quantity") * col("UnitPrice")) \
    .write.format("parquet").option("header","true").mode("Overwrite") \
    .save(f"{output_dir}/transaction_altered/")

# Access the altered table from the same connector:
assert S3Connector(output_dir).has_table("transaction_altered")

Uploading Local Tables

For local files, you can use upload_table() to upload Parquet or CSV files directly to Kumo. Files >1GB are supported by default through automatic partitioning. Once uploaded, access tables via FileUploadConnector.
from kumoai.connector import upload_table

# Upload local file (supports >1GB automatically)
upload_table(name="my_table", path="/path/to/local/file.parquet")

# Access uploaded table
connector = kumo.FileUploadConnector(file_type="parquet")
my_table_src = connector["my_table"]
Key parameters: name (table name), path (local file path), auto_partition (default True for >1GB files), partition_size_mb (default 250MB).