Skip to main content
KumoRFM provides an evaluation mode that automatically measures prediction quality by performing a train/test split on context examples and computing relevant metrics.

Running an Evaluation

Use KumoRFM.evaluate() with the same PQL syntax as `KumoRFM.predict()`:
metrics = model.evaluate(
    "PREDICT COUNT(orders.*, 0, 30, days) > 0 FOR users.user_id=1",
    run_mode="FAST",
)
print(metrics)
The evaluation collects context examples, splits them into in-context (training) and test sets, generates predictions for the test set, and computes metrics comparing predictions to actual outcomes. You can also use the EVALUATE keyword in the query string directly:
metrics = model.evaluate(
    "EVALUATE PREDICT COUNT(orders.*, 0, 30, days) FOR users.user_id=1"
)

Available Metrics

The metrics returned depend on the detected task type:
Task TypeSupported Metrics
Binary Classificationaccuracy, precision, recall, f1, mrr, auc
Multi-Class Classificationacc, precision, recall, f1, mrr
Regression / Forecastingmae, mape, mse, rmse, smape, r2
You can specify which metrics to compute:
metrics = model.evaluate(
    "PREDICT SUM(orders.price, 0, 30, days) FOR items.item_id=42",
    metrics=["mae", "rmse", "r2"],
)

Evaluation Parameters

The KumoRFM.evaluate() method accepts the same parameters as KumoRFM.predict(), plus:
  • metrics: A list of metric names to compute. If not specified, all applicable metrics for the task type are computed.
The run_mode, anchor_time, num_hops, and other parameters work identically to KumoRFM.predict(). See configuration for details on run modes.

Evaluation with TaskTable

For advanced use cases, you can construct a TaskTable explicitly and use `KumoRFM.evaluate_task()`:
task = TaskTable(
    task_type="binary_classification",
    context_df=context_dataframe,
    pred_df=prediction_dataframe,
    entity_table_name="users",
    entity_column="user_id",
    target_column="target",
    time_column="timestamp",
)

metrics = model.evaluate_task(task)
This gives you full control over the train/test split and context construction.

Interpreting Results

The evaluation returns a dictionary mapping metric names to values:
>>> metrics = model.evaluate(query)
>>> print(metrics)
{'mae': 12.5, 'rmse': 15.3, 'r2': 0.82}
Higher values are better for r2, accuracy, precision, recall, f1, and auc. Lower values are better for mae, mape, mse, rmse, smape.