Skip to main content
KumoRFM provides an evaluation mode that automatically measures prediction quality by performing a train/test split on context examples and computing relevant metrics.

Running an Evaluation

Use KumoRFM.evaluate() with the same PQL syntax as `KumoRFM.predict()`:
metrics = model.evaluate(
    "PREDICT COUNT(orders.*, 0, 30, days) > 0 FOR users.user_id=1",
    run_mode="fast",
)
print(metrics)
The evaluation collects context examples, splits them into in-context (training) and test sets, generates predictions for the test set, and computes metrics comparing predictions to actual outcomes. You can also use the EVALUATE keyword in the query string directly:
metrics = model.evaluate(
    "EVALUATE PREDICT COUNT(orders.*, 0, 30, days) FOR users.user_id=1"
)

Available Metrics

The metrics returned depend on the detected task type:
Task TypeSupported Metrics
Binary Classificationacc, precision, recall, f1, auroc, auprc, ap
Multi-Class Classificationacc, precision, recall, f1, mrr
Regression / Forecastingmae, mape, mse, rmse, smape, r2
Temporal Link Predictionmap@k, ndcg@k, mrr@k, precision@k, recall@k, f1@k, hit_ratio@k
You can specify which metrics to compute:
metrics = model.evaluate(
    "PREDICT SUM(orders.price, 0, 30, days) FOR items.item_id=42",
    metrics=["mae", "rmse", "r2"],
)

Evaluation Parameters

The KumoRFM.evaluate() method accepts the same parameters as KumoRFM.predict(), plus:
  • metrics: A list of metric names to compute. If not specified, all applicable metrics for the task type are computed.
The run_mode, anchor_time, num_hops, and other parameters work identically to KumoRFM.predict(). See configuration for details on run modes.

Evaluation with TaskTable

For advanced use cases, you can construct a TaskTable explicitly and use `KumoRFM.evaluate_task()`:
from kumoai.rfm import TaskTable

task = TaskTable(
    task_type="binary_classification",
    context_df=context_dataframe,
    pred_df=prediction_dataframe,
    entity_table_name="users",
    entity_column="user_id",
    target_column="target",
    time_column="timestamp",
)

metrics = model.evaluate_task(task)
This gives you full control over the train/test split and context construction.

Interpreting Results

The evaluation returns a pandas.DataFrame with metric and value columns:
>>> metrics = model.evaluate(query)
>>> print(metrics)
  metric  value
0    mae   12.5
1   rmse   15.3
2     r2   0.82
Higher values are better for r2, acc, auroc, auprc, ap, precision, recall, and f1. Lower values are better for mae, mape, mse, rmse, smape.