1. Loading Data
KumoRFM interacts with pandas.DataFrame.You can ingest data into memory from various sources — local files, cloud data warehouses, REST API, etc. There’s no hard limit on data size, but all DataFrames should fit into memory for processing.
Some examples:
2. Creating LocalTable
Once loaded, you will create LocalTable objects on top of the DataFrames. A LocalTable acts as a lightweight abstraction of a DataFrame, providing additional integration. It defines four critical properties:stype (Semantic Type):- A
stypewill determine how the column will be encoded downstream. - Correctly setting each column’s stype is critical for model performance. For instance, if you want to perform missing value imputation, the semantic type will determine whether it is treated as a regression task (
stype="numerical") or a classification task (stype="categorical").
- A
| Type | Explanation | Example | |||
|---|---|---|---|---|---|
"numerical" | Numerical values (e.g., price, age) | 25, 3.14, -10 | |||
"categorical" | Discrete categories with limited cardinality | Color: "red", "blue", "green" (one cell may only have one category) | |||
"multicategorical" | Multiple categories in a single cell | `“Action | Drama | Comedy”, ”Action | Thriller”` |
"ID" | An identifier, e.g., primary keys or foreign keys | user_id: 123, product_id: PRD-8729453 | |||
"text" | Natural language text | Descriptions | |||
"timestamp" | Specific point in time | "2025-07-11", "2023-02-12 09:47:58” | |||
"sequence" | Custom embeddings or sequential data | [0.25, -0.75, 0.50, ...] |
primary_key:- The primary key is a unique identifier of each row in a table.
- If there are duplicated primary keys, the system will only keep the first one.
- A primary key can be used to link tables through primary key—foreign key relationship.
- In the
userstable:user_idis the primary key. - In the
orderstable:order_idis the primary key, anduser_idis a foreign key that points back to theuserstable. - These tables can be linked via
user_id(see example code below on how to link). - A primary key does not need to link to other tables. For example, in the
orderstable, the primary key (order_id) is not used for linking, but it still serves its main purpose—to uniquely identify each row in the table.
- In the
primary_keycan only be assigned to columns holding integers, floating point values or strings.- Each table can have at most one
primary_keycolumn. uniquely identifies each row in a table (e.g., user_id is the primary key in the users table). It serves two purpose: (1) when creating a graph, it’s the reference point to link other tables (2) when making predictions, it identifies the entity to generate predictions for. For instance, if you want to predict user outcomes, you’ll need a table with user_id as the primary key.
time_column:- Indicates the timestamp column that record when the event occurred.
- Time column data must be able to be parsed via
pandas.to_datetime. - Each table can have at most one
time_columncolumn.
end_time_column:- Indicates the timestamp column that record when the event should be dropped from consideration (e.g. when a user becomes inactive).
- End time column data must be able to be parsed via
pandas.to_datetime. - Each table can have at most one
end_time_columncolumn.
3. Connecting Tables to Form a Graph
After creating your tables, the next step is to link them into a LocalGraph.A good guiding principle is to start simple: begin with just the minimal set of tables needed to support the prediction task you care about. Focus on the core entities and relationships essential to prediction. For example, suppose your goal is to predict a user’s future orders (how much they’d purchase). At a minimum, your graph only needs two tables:
users: representing each userorders: representing the orders placed by those users
items table, so that RFM can take into account item information.
Example: Building a Customer–Transaction Graph
items table.
4. Initiating the model
You are now ready to plug your graph into KumoRFM to make predictions!This is a one-time setup—once it’s in place, you can generate a variety of predictions from it and power many business use cases.