What columns should I select in a table?
For optimal results, you should ensure that any table columns you select for Kumo ingestion meet the following criteria:
-
Clean: be sure to remove fake/synthetic data, predictions from other ML models, data for which the column definition has constantly changed over time (especially if a particular attribute ID may point to different things over time), and data that is known to be otherwise unreliable or frequently inaccurate.
-
Relevant and Mutually Exclusive: the larger the graph size (i.e., the sum across the tables in a graph), the larger the compute cost; to optimize training costs, remove columns that provide similar/duplicated information, irrelevant information, and other extraneous data.
-
Complete: the column should cover the full history across the timeframe in question (e.g., the whole record of purchases/interactions versus a user’s first/last purchase, or a subscriber’s most recent interaction). If this results in an oversized data set, you can provide Kumo with a compressed version that indicates changes in aggregate metrics over time (e.g., per day/week/month).
Using the wrong or unnecessary columns can lead to both degraded model performance and increased training costs.