Documentation Index
Fetch the complete documentation index at: https://kumo.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
This section outlines the best practices for preparing high-quality datasets for KumoRFM, incorporating key insights from Table and Graph design patterns.
Table Structure and Entity Design
-
One entity or event per table:
# Good: Separate tables for different entities
users = df[['user_id', 'user_name', 'user_email']]
transactions = df[['transaction_id', 'user_id', 'amount', 'timestamp']]
# Avoid: Mixing entities in one table
mixed_data = df[['user_id', 'user_name', 'transaction_id', 'amount']]
-
Single time column per table:
# Good: Split tables with multiple timestamps
policies = df[['policy_id', 'user_id', 'start_date', 'policy_type']]
claims = df[['claim_id', 'policy_id', 'claim_date', 'claim_amount']]
# Avoid: Multiple time columns in one table
mixed_times = df[['policy_id', 'start_date', 'claim_date', 'end_date']]
-
Handle many-to-many relationships with junction tables:
# Good: Junction table pattern
users = df[['user_id', 'user_name']]
skills = df[['skill_id', 'skill_name']]
user_skills = df[['user_skill_id', 'user_id', 'skill_id', 'proficiency']]
Data Preparation
-
Modify dtypes at `pandas.DataFrame` level before creating tables:
# Good: Set proper pandas dtypes first
df['user_id'] = df['user_id'].astype('int64')
df['category'] = df['category'].astype('string')
df['timestamp'] = pd.to_datetime(df['timestamp'])
table = rfm.LocalTable(df, "my_table")
-
Use consistent naming conventions:
# Good: Consistent foreign key naming
users.user_id, transactions.user_id, profiles.user_id
# Avoid: Inconsistent naming
users.id, transactions.uid, profiles.customer_id
-
Ensure unique primary keys:
# Validate uniqueness
assert df['user_id'].nunique() == len(df)
assert df['user_id'].notna().all()
# Otherwise, Kumo will automatically drop duplicates internally!
Semantic Type Assignment
-
Choose meaningful semantic types:
# IDs should use ID stype
table['user_id'].stype = 'ID'
# Text descriptions should use text stype
table['description'].stype = 'text'
# Limited categories should use categorical stype
table['status'].stype = 'categorical'
# Numerical measurements should use numerical stype
table['amount'].stype = 'numerical'
# Important: Validate metadata before proceeding:
print(table.metadata)
Graph Construction
-
Design for meaningful relationships:
# Good: Meaningful entity relationships
graph = rfm.Graph.from_data({
'users': users_df, # Entity table
'transactions': transactions_df, # Event table linked to users
'products': products_df # Entity table linked via transactions
})
-
Ensure prediction-ready structure:
# Consider what PQL queries you want to run
# Example: "PREDICT COUNT(transactions.*, 0, 30, days) FOR users.user_id=1"
# Requires: users -> transactions relationship via user_id
# Requires: Timestamp at the transaction table
Common Data Modeling Patterns
Entity-Event Pattern
# Users (entity) with transactions (events)
users = pd.DataFrame({'user_id': [1, 2], 'name': ['Alice', 'Bob']})
transactions = pd.DataFrame({
'transaction_id': [1, 2, 3],
'user_id': [1, 1, 2],
'amount': [100, 50, 200],
'timestamp': ['2023-01-01', '2023-01-02', '2023-01-03']
})
Hierarchical Entities
# Multiple levels: company -> department -> employee
companies = pd.DataFrame({'company_id': [1, 2], 'company_name': ['ACME', 'TechCorp']})
departments = pd.DataFrame({'dept_id': [1, 2], 'company_id': [1, 1], 'dept_name': ['Engineering', 'Sales']})
employees = pd.DataFrame({'emp_id': [1, 2, 3], 'dept_id': [1, 1, 2], 'emp_name': ['Alice', 'Bob', 'Charlie']})
Junction Table for Many-to-Many
# Products and categories with many-to-many relationship
products = pd.DataFrame({'product_id': [1, 2], 'product_name': ['Laptop', 'Mouse']})
categories = pd.DataFrame({'category_id': [1, 2], 'category_name': ['Electronics', 'Accessories']})
product_categories = pd.DataFrame({
'product_category_id': [1, 2, 3],
'product_id': [1, 1, 2],
'category_id': [1, 2, 2]
})
Summary
Following these best practices will help ensure your KumoRFM datasets are well-structured, validated, and optimized for performance:
Table Design:
- One entity or event per table
- Single time column per table
- Unique primary keys with consistent naming
- Junction tables for many-to-many relationships
Data Preparation:
- Set proper pandas dtypes before creating tables
- Use meaningful semantic types (ID, categorical, text, numerical)
- Validate metadata and semantic types before proceeding
Graph Structure:
- Design meaningful entity relationships
- Consider PQL query requirements in your structure
- Ensure single connected component
- Test with validation workflow
These patterns will help you create robust, queryable datasets that work effectively with KumoRFM’s predictive capabilities.