Graph Definitions

Graph objects represent the relational structure between your tables. The key to a good graph is having well-prepared tables underneath - proper dtypes, stypes, primary keys, and time columns in the individual tables are essential for graph success.

Graph Structure and Metadata

A Graph holds two types of information:

Tables: The collection of LocalTable objects containing your data
Edges: The relational metadata defining how tables connect through primary/foreign key relationships

The edges are the crucial metadata that transforms individual tables into a connected relational structure, enabling KumoRFM to understand and leverage relationships in your data.

Graph Construction Methods

Graph provides several factory methods for different data sources:

Graph.from_data() — from pandas DataFrames (see below)
Graph.from_sqlite() — from a SQLite database (see SQLite Connector)
Graph.from_snowflake() — from a Snowflake warehouse (see Snowflake Connector)
Graph.from_relbench() — from RelBench benchmark datasets (see RelBench)

From pandas DataFrames, you can construct a graph in two ways:

import kumoai.rfm as rfm

# Method 1: Utility function (recommended for most cases)
# Automatically creates tables from data frames, infers metadata, and finds links
graph = rfm.Graph.from_data({
    'users': df_users,
    'products': df_products,
    'transactions': df_transactions
})

# Method 2: Manual construction from pre-configured table objects
tables = [users_table, products_table, transactions_table]
graph = rfm.Graph(tables=tables)
graph.infer_links()  # or define links manually

The utility function Graph.from_data() is often preferred because it:

Creates LocalTable objects from your data frames
Calls infer_metadata() on each table (see Table Definitions)
Automatically infers links between tables based on column names

Link Inference and Naming Conventions

Link inference is based on column names, making consistent naming conventions crucial for automatic graph construction:

# For example, these column patterns create automatic links:
# transactions.user_id -> users.user_id (or users.id)
# orders.product_id -> products.product_id (or products.id)
# reviews.customer_id -> customers.customer_id (or customers.id)

# View inferred edges
for edge in graph.edges:
    print(f"{edge.src_table}.{edge.fkey} -> {edge.dst_table}")

Best practice: Use consistent foreign key naming (e.g., always use user_id, not mixing user_id, uid, customer_id for the same relationship).

Manual Link Management

If you cannot rename columns to follow consistent patterns, you can add links manually:

# Add specific edge
graph.link(src_table="transactions", fkey="user_id", dst_table="users")

# Remove edge
graph.unlink(src_table="transactions", fkey="user_id", dst_table="users")

What Makes a Good Graph

A good Graph should have:

Well-prepared tables: The tables should be well-prepared, and split up according to best practices (see Table Definitions)
Meaningful links: Edges should represent meaningful relationships between tables, not just technical connections
Entities are well-defined: Each table should represent either a single entity or a single event, not a mix of both
Includes prediction ready structure: graph structure imposes limitations on the queries that can be defined with PQL (see Make Predictions), so make sure that PQL queries you want to run are possible with the graph structure

Working around the limitations

Multiple entities in a single table Tables that mix data from multiple entities should be split for better graph structure. Think about each table as representing a single entity type or event. Here’s an example:

# Original table mixing transaction, bank, and user data
mixed_data = pd.DataFrame({
    'transaction_id': [1, 2, 3],
    'bank_id': [101, 102, 101],
    'user_id': [201, 202, 203],
    'transaction_amount': [100.0, 250.0, 75.0],
    'transaction_type': ['deposit', 'withdrawal', 'transfer'],
    'bank_name': ['Chase', 'Wells Fargo', 'Chase'],
    'bank_routing': ['123456', '789012', '123456'],
    'user_name': ['Alice', 'Bob', 'Charlie'],
    'user_email': ['alice@email.com', 'bob@email.com', 'charlie@email.com']
})

# Split into three entity-focused tables

# 1. Transactions table (transaction-specific data)
transactions = mixed_data[['transaction_id', 'bank_id', 'user_id', 'transaction_amount', 'transaction_type']].copy()

# 2. Banks table (bank-specific data)
banks = mixed_data[['bank_id', 'bank_name', 'bank_routing']].drop_duplicates()

# 3. Users table (user-specific data)
users = mixed_data[['user_id', 'user_name', 'user_email']].drop_duplicates()

# Create graph with proper entity relationships
graph = rfm.Graph.from_data({
    'transactions': transactions,
    'banks': banks,
    'users': users
})
# Result: transactions.bank_id -> banks.bank_id and transactions.user_id -> users.user_id

Many-to-many relationships KumoRFM only supports primary-foreign key relationships (one-to-many). Many-to-many relationships require a junction table to break them into two one-to-many relationships:

# Problem: Table with many-to-many data stored as lists/comma-separated values
user_skills_combined = pd.DataFrame({
    'user_id': [1, 2, 3],
    'user_name': ['Alice', 'Bob', 'Charlie'],
    'skills': [['Python', 'SQL'], ['SQL', 'Machine Learning'], ['Python', 'Machine Learning']],
    'proficiency_levels': [['expert', 'beginner'], ['intermediate', 'advanced'], ['expert', 'expert']]
})

# This structure cannot create proper foreign key relationships in KumoRFM

# Solution: Normalize into three tables with junction table

# 1. Users table (entity table)
users = user_skills_combined[['user_id', 'user_name']].copy()

# 2. Skills table (entity table)
all_skills = []
for skill_list in user_skills_combined['skills']:
    all_skills.extend(skill_list)
unique_skills = list(set(all_skills))

skills = pd.DataFrame({
    'skill_id': range(1, len(unique_skills) + 1),
    'skill_name': unique_skills
})

# 3. Junction table (breaks many-to-many into two one-to-many)
user_skills_records = []
for _, row in user_skills_combined.iterrows():
    for skill, proficiency in zip(row['skills'], row['proficiency_levels']):
        skill_id = skills[skills['skill_name'] == skill]['skill_id'].iloc[0]
        user_skills_records.append({
            'user_skill_id': len(user_skills_records) + 1,
            'user_id': row['user_id'],
            'skill_id': skill_id,
            'proficiency_level': proficiency
        })

user_skills = pd.DataFrame(user_skills_records)

# Create graph with proper one-to-many relationships
graph = rfm.Graph.from_data({
    'users': users,
    'skills': skills,
    'user_skills': user_skills
})
# Result: user_skills.user_id -> users.user_id and user_skills.skill_id -> skills.skill_id

This normalization allows proper foreign key relationships and stores relationship-specific attributes (like proficiency_level) in the junction table.

Graph Utilities

Visualizing the graph:

graph.visualize()

This displays an interactive visualization of the graph structure showing tables, columns, and edges. Useful for verifying that links were inferred correctly. Validating the graph:

graph.validate()

Checks that the graph meets all requirements for use with KumoRFM, including valid primary keys, consistent foreign key types, and proper edge definitions. Always validate before running predictions.

​Graph Structure and Metadata

​Graph Construction Methods

​Link Inference and Naming Conventions

​Manual Link Management

​What Makes a Good Graph

​Working around the limitations

​Graph Utilities