> ## Documentation Index
> Fetch the complete documentation index at: https://kumo.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Data Requirements

> Data types, semantic types, and requirements for KumoRFM

This guide outlines the data requirements and best practices for working with `KumoRFM` (Kumo Relational Foundation Model). Understanding these requirements is essential for creating high-quality datasets that maximize `KumoRFM`'s predictive capabilities.

## Introduction

`KumoRFM` operates on relational data organized as interconnected tables forming a graph structure. The foundation of this process starts with a set of `pandas.DataFrame` objects, which are transformed into `LocalTable` objects and assembled into a `Graph`. Proper data preparation ensures optimal model performance and reliable predictions.

### Key Terms and Concepts

Before diving into the technical details, it's important to understand the key terms used throughout this guide:

* **pandas DataFrame**\
  A two-dimensional labeled data structure in `pandas`, similar to a spreadsheet or SQL table. Data frames are the starting point for all `KumoRFM` data preparation workflows. A collection of data frames connected by pkey/fkey relationships defines a relational database.

* **pandas dtype**\
  The data type of a `pandas.Series` or `pandas.DataFrame` column (*e.g.*, `int64`, `float64`, `object`, `bool`). These represent how `pandas` stores and processes the data internally.

* **Kumo Dtype** (`kumoai.Dtype`)\
  KumoRFM's representation of physical data storage types (*e.g.*, `Dtype.int`, `Dtype.string`, `Dtype.float`). These are mapped from `pandas` dtypes and determine how data is processed by the foundation model.

* **Kumo Stype** (`kumoai.Stype`)\
  Semantic types that define how the data should be interpreted by the foundation model (*e.g.*, `Stype.numerical`, `Stype.categorical`, `Stype.ID`). These determine what preprocessing and modeling techniques are applied to each column.

* **LocalTable** (`LocalTable`)\
  A wrapper around a `pandas.DataFrame` that includes metadata such as column types, the primary key, and time column. Each table can have at most one primary key and at most one time column, but it can contain many foreign keys (primary keys of other tables). A `LocalTable` is the fundamental building block in order to define `KumoRFM` graphs.

* **Graph** (`Graph`)\
  A collection of interconnected `LocalTable` objects representing the relational structure of your data. The `Graph` defines how tables relate to each other through primary/foreign key relationships. How we connect the tables is a modeling decision that is important for the performance of the foundation model.

Understanding the distinction between **dtype** (physical storage) and **stype** (semantic meaning) is crucial: a column with `Dtype.string` could have `Stype.categorical` (for category labels) or `Stype.text` (for natural language), leading to completely different preprocessing approaches. The other important modeling decision is the structure of graph, it affects both the performance of `KumoRFM` on the data as well as which predictions can be defined with PQL (see [Make Predictions](/rfm/make-predictions)).

## Data Connectors

KumoRFM supports multiple ways to load data into a \`Graph\`:

| Backend                 | Best For                                   | Entry Point               |
| ----------------------- | ------------------------------------------ | ------------------------- |
| **LocalTable** (pandas) | Small to medium datasets already in memory | `Graph.from_data()`       |
| **SQLite**              | File-based databases, prototyping          | `Graph.from_sqlite()`     |
| **DuckDB**              | In-process analytics, large local files    | `Graph.from_duckdb()`     |
| **Databricks**          | Databricks SQL warehouses (Unity Catalog)  | `Graph.from_databricks()` |
| **Snowflake**           | Enterprise data warehouses                 | `Graph.from_snowflake()`  |
| **RelBench**            | Benchmarking and experimentation           | `Graph.from_relbench()`   |

See the connector-specific pages for [Snowflake](/rfm/connectors/snowflake) and [SQLite](/rfm/connectors/sqlite) for installation and usage details.

## Guide Structure

This guide is organized into focused sections for easy navigation:

* `Data Types & Semantic Types`
* `Table Definitions`
* `Graph Definitions`
* `Best Practices`
* `Snowflake`
* `SQLite`
* `DuckDB`
* `Databricks`
* `RelBench`

## Getting Started

For a complete end-to-end workflow, here's the typical process:

```python theme={null}
import pandas as pd
import kumoai.rfm as rfm

# 1. Prepare your pandas DataFrames with proper dtypes
df_users = pd.DataFrame({
    'user_id': pd.Series([1, 2, 3], dtype='int64'),
    'name': pd.Series(['Alice', 'Bob', 'Charlie'], dtype='string'),
    'age': pd.Series([25, 30, 35], dtype='int32')
})

df_transactions = pd.DataFrame({
    'transaction_id': pd.Series([1, 2, 3], dtype='int64'),
    'user_id': pd.Series([1, 2, 1], dtype='int64'),
    'amount': pd.Series([100.0, 250.0, 75.0], dtype='float64'),
    'timestamp': pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03'])
})

# 2. Create graph (automatically creates tables and infers metadata)
graph = rfm.Graph.from_data({
    'users': df_users,
    'transactions': df_transactions
})

# 3. Validate the graph
graph.validate()

# 4. Use with KumoRFM
model = rfm.KumoRFM(graph)

model.predict("PREDICT users.age FOR users.user_id=1")
```

This example demonstrates the core workflow. For detailed explanations of each step, refer to the specific sections linked above.
