Ask any VP of Data Science what slows down their ML pipeline and they will say feature engineering. Ask them why, and most will blame tooling, process, or talent. Not enough tooling. Too much coordination overhead. Not enough senior engineers.
They are wrong about the cause. Feature engineering is slow because it is a combinatorial problem disguised as a software engineering problem. No amount of better tools or faster SQL will fix it, because the bottleneck is not execution. It is the impossibility of exploring a feature space that grows exponentially with your database schema.
feature_space_explosion
| database_complexity | tables | columns | first_order_features | pairwise_interactions | multi-hop_features |
|---|---|---|---|---|---|
| Simple | 3 | 30 | 720 | 259,080 | ~2,000 |
| Moderate | 5 | 50 | 1,200 | 719,400 | ~8,000 |
| Enterprise | 10 | 100 | 2,400 | 2,878,800 | ~50,000 |
| Large enterprise | 15 | 200 | 4,800 | 11,518,800 | ~200,000 |
| Mega-scale | 25 | 500 | 12,000 | 71,994,000 | ~1,000,000+ |
The feature space grows superlinearly with schema complexity. Going from 5 to 15 tables does not 3x the features. It 40x the pairwise interactions. Humans explore 50-200 of these.
human_exploration_rate
| task | features_built | feature_space | coverage | time_spent |
|---|---|---|---|---|
| Churn prediction | 127 | 1,200+ | 10.6% | 14.1 hours |
| Fraud detection | 203 | 2,400+ | 8.5% | 18.7 hours |
| CLV prediction | 89 | 1,200+ | 7.4% | 11.2 hours |
| Demand forecasting | 156 | 4,800+ | 3.3% | 16.8 hours |
| Credit risk | 184 | 2,400+ | 7.7% | 15.3 hours |
Highlighted: demand forecasting has the worst coverage (3.3%) because the feature space includes cross-table signals from products, stores, campaigns, suppliers, and customer segments. The most predictive patterns sit in the 96.7% nobody explores.
The math nobody talks about
Take a simple enterprise database. Five tables connected by foreign keys: customers, orders, products, support_tickets, and invoices. Each table has about 10 columns that could be useful for prediction.
To build features for a churn prediction model, a data scientist needs to decide four things for each feature:
- Which table to pull from (5 options)
- Which column to use (10 per table, so 50 total)
- Which aggregation to apply: sum, count, average, max, min, or distinct count (6 options)
- Which time window: 7 days, 30 days, 90 days, or all time (4 options)
The math is straightforward. For first-order features (single column, single aggregation, single window):
50 columns x 6 aggregations x 4 time windows = 1,200 possible features
That is just the first order. Add second-order interactions (ratios between features, differences, products):
1,200 x 1,199 / 2 = 719,400 possible pairwise interactions
Now add multi-hop features that span more than one join. A feature like "average return rate of products this customer has purchased" requires joining customers to orders to products to returns. A feature like "churn rate of customers who bought similar products" requires joining through products back to other customers.
With 5 tables and up to 3 hops, the number of possible join paths alone is in the hundreds. Multiply by columns, aggregations, and time windows, and the total feature space reaches tens of thousands of possible features.
Why better tools do not help
The instinctive response to "feature engineering is slow" is to build better tools. Faster SQL editors. Feature stores with templates. Low-code platforms. Libraries like Featuretools that auto-generate features.
These tools address the execution time of feature engineering. They make it faster to write and deploy the features you have already decided to build. But execution is not the bottleneck.
The real bottleneck: deciding what to build
A senior data scientist does not spend 12.3 hours per task because SQL is hard. They spend 12.3 hours because they are searching a combinatorial space. They write a feature. They test it. It does not help. They write another. They read the schema. They wonder if there is a multi-hop signal they are missing. They try a different time window. They discover that their aggregation destroyed a temporal pattern. They start over.
This is not a process that gets meaningfully faster with better tooling. It is a search problem, and the search space grows exponentially with database complexity.
Featuretools and automated generation
Libraries like Featuretools try to solve this by generating all possible features programmatically. This is closer to the right idea, but it creates two new problems:
- Feature explosion. Generating all first-order features across 5 tables with all aggregations and time windows produces 1,200+ features, most of which are noise. You need another round of feature selection (using statistical tests or model-based importance) to prune them down. This shifts the bottleneck from generation to selection.
- Fixed aggregation patterns. Automated feature generators apply pre-programmed transformations: sum, count, average, max, min. They cannot learn new types of relationships. They cannot discover that the sequence of purchases matters, not just the count. They cannot learn that the interaction between ticket severity and invoice timing predicts churn.
The three types of signals humans miss
The combinatorial explosion is the structural reason feature engineering is slow. But within that huge feature space, there are three specific categories of signals that are disproportionately predictive and disproportionately missed.
1. Multi-hop relationships
The most valuable patterns often span 3 or 4 tables. Consider churn prediction:
- A customer's risk depends on the return rates of the products they bought (2 hops: customer → orders → products → returns)
- A customer's risk depends on the satisfaction of other customers who bought the same products (3 hops: customer → orders → products → orders → customer satisfaction)
- A customer's risk depends on whether the brands they prefer have been declining in quality (3 hops: customer → orders → products → brand → quality metrics)
Nobody writes these features. Not because the SQL is hard, but because the join paths are not intuitive. You have to think "what if the behavior of other people who bought the same thingspredicts this person's churn?" That is a graph reasoning problem, and humans are bad at graph reasoning over more than 2 hops.
2. Temporal sequences
Aggregation destroys temporal patterns. "5 orders in the last 30 days" tells you nothing about the trajectory. Were those orders evenly spaced? All in the first week? Accelerating? Decelerating?
orders (raw relational data)
| order_id | customer_id | date | amount | week |
|---|---|---|---|---|
| O-401 | C-9901 | Mar 1 | $89 | Week 1 |
| O-402 | C-9901 | Mar 3 | $67 | Week 1 |
| O-403 | C-9901 | Mar 5 | $45 | Week 1 |
| O-404 | C-9901 | Mar 14 | $22 | Week 2 |
| O-405 | C-9901 | Mar 28 | $12 | Week 4 |
| O-501 | C-9903 | Mar 2 | $34 | Week 1 |
| O-502 | C-9903 | Mar 9 | $45 | Week 2 |
| O-503 | C-9903 | Mar 16 | $52 | Week 3 |
| O-504 | C-9903 | Mar 23 | $68 | Week 4 |
| O-505 | C-9903 | Mar 30 | $81 | Week 4 |
Highlighted: C-9901 placed 3 orders in week 1 then trailed off. Amounts declined from $89 to $12. C-9903 placed 1 order per week with amounts growing from $34 to $81. Both have 5 orders in 30 days.
flat_feature_table (what the model sees)
| customer_id | orders_30d | avg_order_value | total_spend | days_since_last | reality |
|---|---|---|---|---|---|
| C-9901 | 5 | $47.00 | $235 | 2 | Disengaging (churn risk) |
| C-9903 | 5 | $56.00 | $280 | 0 | Accelerating (growth) |
Both customers show orders_30d = 5 and similar averages. The flat table cannot represent 'front-loaded and declining' vs 'evenly-spaced and growing.' The temporal trajectory is the signal. The count is noise.
The temporal sequence is highly predictive. A customer whose order frequency is declining by 15% per week is very different from one whose frequency is stable, even if the 30-day aggregate looks identical.
To capture this manually, you need to build features like order_frequency_week_1, order_frequency_week_2, order_frequency_week_3, order_frequency_week_4, and then a frequency_trend derived from the slope. For every metric, across every time granularity. This is where the 878 lines of code come from.
3. Cross-table interactions
The most predictive signals are often interactions between features from different tables. A customer with high usage but declining invoice amounts (tables: usage_events + invoices) is different from one with high usage and stable invoices. A customer who filed a support ticket right after a product downgrade (tables: support_tickets + subscriptions) has a different risk profile than one who filed a ticket for a different reason.
These cross-table interactions exist in the relational structure of the database. But after flattening, they only exist if someone thought to build them. And the number of possible interactions is the square of the number of base features.
What humans explore
- 50-200 features per task (4-17% of first-order space)
- Mostly single-table aggregates
- 1-2 hop join paths
- Pre-defined time windows (7d, 30d, 90d)
- Manual interaction features (if any)
What the model sees
- Full combinatorial feature space
- Cross-table patterns across all tables
- Multi-hop paths up to 4+ hops
- Continuous temporal dynamics, not fixed windows
- Learned interactions at every layer
Why the problem gets worse over time
The combinatorial explosion has two growth dimensions, and enterprises are expanding on both.
Database complexity is increasing
Modern enterprise databases are getting more connected, not less. Event streaming adds high-frequency tables. Product catalogs get more granular. Customer interaction data spans email, chat, web, mobile, and in-store. A database that had 5 tables and 50 columns in 2015 might have 15 tables and 200 columns in 2026.
The feature space scales superlinearly with schema complexity. Going from 5 tables to 15 tables does not triple the feature space. It increases it by an order of magnitude because of the combinatorial growth in join paths and interactions.
Prediction tasks are multiplying
Ten years ago, a data science team might maintain 3 to 5 models. Today, the same team is expected to deliver predictions for churn, upsell, cross-sell, fraud, credit risk, demand forecasting, personalization, campaign targeting, price optimization, and more. Each task requires its own round of feature engineering.
If each task takes 12.3 hours of feature engineering and a team maintains 20 tasks, that is 246 hours, or roughly 6 weeks of a senior data scientist's time, just on feature engineering. Before any modeling, evaluation, or deployment.
multi-hop_signals_missed
| hop_depth | example_signal | tables_traversed | ever_built_manually |
|---|---|---|---|
| 1 hop | Customer order count last 30d | customers > orders | Always |
| 2 hops | Avg return rate of products purchased | customers > orders > returns | Sometimes |
| 3 hops | Churn rate of similar-product buyers | customers > orders > products > orders > customers | Rarely |
| 4 hops | Brand quality trend for preferred brands | customers > orders > products > brands > quality | Almost never |
Highlighted: 3-hop and 4-hop signals are disproportionately predictive but almost never built by humans. The join paths are not intuitive. You have to think 'what if other people who bought the same things predict this person's behavior?' That is graph reasoning.
PQL Query
PREDICT churn FOR EACH customers.customer_id WITHIN 30 days
One line explores the entire combinatorial feature space. The foundation model automatically discovers multi-hop patterns, temporal sequences, and cross-table interactions that 12.3 hours of manual feature engineering misses.
Output
| customer_id | churn_prob | top_multi_hop_signal | hops |
|---|---|---|---|
| C-9901 | 0.91 | Products purchased have rising return rates from other buyers | 3 |
| C-9902 | 0.73 | Preferred brand declining in quality metrics | 4 |
| C-9903 | 0.22 | Similar-product buyers are stable | 3 |
| C-9904 | 0.08 | Product and brand health strong across network | 4 |
The structural solution
The solution to a combinatorial problem is not faster enumeration. It is a different approach that does not require enumeration at all.
Relational Deep Learning (ICML 2024) showed that you can represent a relational database as a temporal heterogeneous graph and train a graph neural network directly on that structure. The GNN explores the relational space through learned message-passing functions, effectively searching the combinatorial feature space during training.
KumoRFM goes further. As a foundation model pre-trained on thousands of databases, it has already learned which types of relational patterns are predictive. It does not need to search the feature space for your database from scratch. It recognizes recency effects, frequency patterns, temporal dynamics, and multi-hop graph structures because it has seen them thousands of times before.
The result: the time per prediction task drops from 12.3 hours to under 1 second. Not because the tool is faster, but because the problem it solves is different. It is not "generate features faster." It is "learn from relational data without generating features."
What this means for your team
If your data science team is spending 80% of their time on feature engineering, you are not under-tooled. You are applying a linear process to an exponential problem. Better SQL editors and feature stores will give you incremental improvements. Eliminating the feature engineering step gives you a structural one.
The combinatorial explosion is not going away. Your databases are getting more complex, and your business wants more prediction tasks. The only viable long-term strategy is to stop enumerating features and start using models that learn from relational structure directly.
That is not a sales pitch. It is the math.