> ## Documentation Index
> Fetch the complete documentation index at: https://kumo.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Best Practices

> Best practices for data preparation with KumoRFM

This section outlines the best practices for preparing high-quality datasets for `KumoRFM`, incorporating key insights from `Table` and `Graph` design patterns.

## Table Structure and Entity Design

1. **One entity or event per table:**

   ```python theme={null}
   # Good: Separate tables for different entities
   users = df[['user_id', 'user_name', 'user_email']]
   transactions = df[['transaction_id', 'user_id', 'amount', 'timestamp']]

   # Avoid: Mixing entities in one table
   mixed_data = df[['user_id', 'user_name', 'transaction_id', 'amount']]
   ```

2. **Single time column per table:**

   ```python theme={null}
   # Good: Split tables with multiple timestamps
   policies = df[['policy_id', 'user_id', 'start_date', 'policy_type']]
   claims = df[['claim_id', 'policy_id', 'claim_date', 'claim_amount']]

   # Avoid: Multiple time columns in one table
   mixed_times = df[['policy_id', 'start_date', 'claim_date', 'end_date']]
   ```

3. **Handle many-to-many relationships with junction tables:**

   ```python theme={null}
   # Good: Junction table pattern
   users = df[['user_id', 'user_name']]
   skills = df[['skill_id', 'skill_name']]
   user_skills = df[['user_skill_id', 'user_id', 'skill_id', 'proficiency']]
   ```

## Data Preparation

1. **Modify dtypes at \`pandas.DataFrame\` level before creating tables:**

   ```python theme={null}
   # Good: Set proper pandas dtypes first
   df['user_id'] = df['user_id'].astype('int64')
   df['category'] = df['category'].astype('string')
   df['timestamp'] = pd.to_datetime(df['timestamp'])
   table = rfm.LocalTable(df, "my_table")
   ```

2. **Use consistent naming conventions:**

   ```python theme={null}
   # Good: Consistent foreign key naming
   users.user_id, transactions.user_id, profiles.user_id

   # Avoid: Inconsistent naming
   users.id, transactions.uid, profiles.customer_id
   ```

3. **Ensure unique primary keys:**

   ```python theme={null}
   # Validate uniqueness
   assert df['user_id'].nunique() == len(df)
   assert df['user_id'].notna().all()
   # Otherwise, Kumo will automatically drop duplicates internally!
   ```

## Semantic Type Assignment

1. **Choose meaningful semantic types:**

   ```python theme={null}
   # IDs should use ID stype
   table['user_id'].stype = 'ID'

   # Text descriptions should use text stype
   table['description'].stype = 'text'

   # Limited categories should use categorical stype
   table['status'].stype = 'categorical'

   # Numerical measurements should use numerical stype
   table['amount'].stype = 'numerical'

   # Important: Validate metadata before proceeding:
   print(table.metadata)
   ```

## Graph Construction

1. **Design for meaningful relationships:**

   ```python theme={null}
   # Good: Meaningful entity relationships
   graph = rfm.Graph.from_data({
       'users': users_df,           # Entity table
       'transactions': transactions_df,  # Event table linked to users
       'products': products_df      # Entity table linked via transactions
   })
   ```

2. **Ensure prediction-ready structure:**

   ```python theme={null}
   # Consider what PQL queries you want to run
   # Example: "PREDICT COUNT(transactions.*, 0, 30, days) FOR users.user_id=1"
   # Requires: users -> transactions relationship via user_id
   # Requires: Timestamp at the transaction table
   ```

## Common Data Modeling Patterns

**Entity-Event Pattern**

```python theme={null}
# Users (entity) with transactions (events)
users = pd.DataFrame({'user_id': [1, 2], 'name': ['Alice', 'Bob']})
transactions = pd.DataFrame({
    'transaction_id': [1, 2, 3],
    'user_id': [1, 1, 2],
    'amount': [100, 50, 200],
    'timestamp': ['2023-01-01', '2023-01-02', '2023-01-03']
})
```

**Hierarchical Entities**

```python theme={null}
# Multiple levels: company -> department -> employee
companies = pd.DataFrame({'company_id': [1, 2], 'company_name': ['ACME', 'TechCorp']})
departments = pd.DataFrame({'dept_id': [1, 2], 'company_id': [1, 1], 'dept_name': ['Engineering', 'Sales']})
employees = pd.DataFrame({'emp_id': [1, 2, 3], 'dept_id': [1, 1, 2], 'emp_name': ['Alice', 'Bob', 'Charlie']})
```

**Junction Table for Many-to-Many**

```python theme={null}
# Products and categories with many-to-many relationship
products = pd.DataFrame({'product_id': [1, 2], 'product_name': ['Laptop', 'Mouse']})
categories = pd.DataFrame({'category_id': [1, 2], 'category_name': ['Electronics', 'Accessories']})
product_categories = pd.DataFrame({
    'product_category_id': [1, 2, 3],
    'product_id': [1, 1, 2],
    'category_id': [1, 2, 2]
})
```

## Summary

Following these best practices will help ensure your KumoRFM datasets are well-structured, validated, and optimized for performance:

**Table Design:**

* One entity or event per table
* Single time column per table
* Unique primary keys with consistent naming
* Junction tables for many-to-many relationships

**Data Preparation:**

* Set proper pandas dtypes before creating tables
* Use meaningful semantic types (ID, categorical, text, numerical)
* Validate metadata and semantic types before proceeding

**Graph Structure:**

* Design meaningful entity relationships
* Consider PQL query requirements in your structure
* Ensure single connected component
* Test with validation workflow

These patterns will help you create robust, queryable datasets that work effectively with KumoRFM's predictive capabilities.
