Why can't better tools fix the feature engineering bottleneck?

Because the problem is not writing SQL faster. The problem is deciding which features to build. With 5 tables, 10 columns per table, 6 aggregation functions, and 4 time windows, there are 1,200 possible first-order features. Adding second-order interactions and multi-hop joins pushes this into the tens of thousands. Better SQL tools do not help you decide which of those thousands of features to try.

How many features does a typical data science team actually build?

In the Stanford RelBench study, experienced data scientists built 50 to 200 features per prediction task, writing an average of 878 lines of code over 12.3 hours. This represents roughly 4-17% of the possible first-order feature space for a moderately complex database. The vast majority of possible features are never explored.

What are multi-hop features and why do humans miss them?

Multi-hop features capture patterns that span 3 or more tables through foreign key relationships. For example: a customer's churn risk depends on the return rates of products they bought, which depends on the manufacturer's quality scores. This is a 4-hop path (customer, orders, products, manufacturer). Humans rarely think to look for these signals because the join path is not intuitive, even though the predictive value can be substantial.

How does a foundation model explore the full feature space?

Relational foundation models like KumoRFM represent the database as a temporal heterogeneous graph and use graph transformers to pass messages along foreign key relationships. Instead of pre-computing a fixed set of features, the model learns which patterns across any number of hops and time windows are predictive. It effectively explores the entire combinatorial feature space during training.

Is the feature engineering bottleneck getting worse over time?

Yes. Enterprise databases are growing more complex (more tables, more columns, more relationships), which expands the combinatorial feature space exponentially. At the same time, companies want more prediction tasks (churn, upsell, fraud, recommendations, demand forecasting). Each new task requires its own round of feature engineering. The cost scales linearly with both database complexity and number of tasks.

Why Feature Engineering Takes So Long (It's Not What You Think) | Kumo.ai

Ask any VP of Data Science what slows down their ML pipeline and they will say feature engineering. Ask them why, and most will blame tooling, process, or talent. Not enough tooling. Too much coordination overhead. Not enough senior engineers.

They are wrong about the cause. Feature engineering is slow because it is a combinatorial problem disguised as a software engineering problem. No amount of better tools or faster SQL will fix it, because the bottleneck is not execution. It is the impossibility of exploring a feature space that grows exponentially with your database schema.

feature_space_explosion

database_complexity	tables	columns	first_order_features	pairwise_interactions	multi-hop_features
Simple	3	30	720	259,080	~2,000
Moderate	5	50	1,200	719,400	~8,000
Enterprise	10	100	2,400	2,878,800	~50,000
Large enterprise	15	200	4,800	11,518,800	~200,000
Mega-scale	25	500	12,000	71,994,000	~1,000,000+

The feature space grows superlinearly with schema complexity. Going from 5 to 15 tables does not 3x the features. It 40x the pairwise interactions. Humans explore 50-200 of these.

human_exploration_rate

task	features_built	feature_space	coverage	time_spent
Churn prediction	127	1,200+	10.6%	14.1 hours
Fraud detection	203	2,400+	8.5%	18.7 hours
CLV prediction	89	1,200+	7.4%	11.2 hours
Demand forecasting	156	4,800+	3.3%	16.8 hours
Credit risk	184	2,400+	7.7%	15.3 hours

Highlighted: demand forecasting has the worst coverage (3.3%) because the feature space includes cross-table signals from products, stores, campaigns, suppliers, and customer segments. The most predictive patterns sit in the 96.7% nobody explores.

The math nobody talks about

Take a simple enterprise database. Five tables connected by foreign keys: customers, orders, products, support_tickets, and invoices. Each table has about 10 columns that could be useful for prediction.

To build features for a churn prediction model, a data scientist needs to decide four things for each feature:

Which table to pull from (5 options)
Which column to use (10 per table, so 50 total)
Which aggregation to apply: sum, count, average, max, min, or distinct count (6 options)
Which time window: 7 days, 30 days, 90 days, or all time (4 options)

The math is straightforward. For first-order features (single column, single aggregation, single window):

50 columns x 6 aggregations x 4 time windows = 1,200 possible features

That is just the first order. Add second-order interactions (ratios between features, differences, products):

1,200 x 1,199 / 2 = 719,400 possible pairwise interactions

Now add multi-hop features that span more than one join. A feature like "average return rate of products this customer has purchased" requires joining customers to orders to products to returns. A feature like "churn rate of customers who bought similar products" requires joining through products back to other customers.

With 5 tables and up to 3 hops, the number of possible join paths alone is in the hundreds. Multiply by columns, aggregations, and time windows, and the total feature space reaches tens of thousands of possible features.

Why better tools do not help

The instinctive response to "feature engineering is slow" is to build better tools. Faster SQL editors. Feature stores with templates. Low-code platforms. Libraries like Featuretools that auto-generate features.

These tools address the execution time of feature engineering. They make it faster to write and deploy the features you have already decided to build. But execution is not the bottleneck.

The real bottleneck: deciding what to build

A senior data scientist does not spend 12.3 hours per task because SQL is hard. They spend 12.3 hours because they are searching a combinatorial space. They write a feature. They test it. It does not help. They write another. They read the schema. They wonder if there is a multi-hop signal they are missing. They try a different time window. They discover that their aggregation destroyed a temporal pattern. They start over.

This is not a process that gets meaningfully faster with better tooling. It is a search problem, and the search space grows exponentially with database complexity.

Featuretools and automated generation

Libraries like Featuretools try to solve this by generating all possible features programmatically. This is closer to the right idea, but it creates two new problems:

Feature explosion. Generating all first-order features across 5 tables with all aggregations and time windows produces 1,200+ features, most of which are noise. You need another round of feature selection (using statistical tests or model-based importance) to prune them down. This shifts the bottleneck from generation to selection.
Fixed aggregation patterns. Automated feature generators apply pre-programmed transformations: sum, count, average, max, min. They cannot learn new types of relationships. They cannot discover that the sequence of purchases matters, not just the count. They cannot learn that the interaction between ticket severity and invoice timing predicts churn.

The three types of signals humans miss

The combinatorial explosion is the structural reason feature engineering is slow. But within that huge feature space, there are three specific categories of signals that are disproportionately predictive and disproportionately missed.

1. Multi-hop relationships

The most valuable patterns often span 3 or 4 tables. Consider churn prediction:

A customer's risk depends on the return rates of the products they bought (2 hops: customer → orders → products → returns)
A customer's risk depends on the satisfaction of other customers who bought the same products (3 hops: customer → orders → products → orders → customer satisfaction)
A customer's risk depends on whether the brands they prefer have been declining in quality (3 hops: customer → orders → products → brand → quality metrics)

Nobody writes these features. Not because the SQL is hard, but because the join paths are not intuitive. You have to think "what if the behavior of other people who bought the same thingspredicts this person's churn?" That is a graph reasoning problem, and humans are bad at graph reasoning over more than 2 hops.

2. Temporal sequences

Aggregation destroys temporal patterns. "5 orders in the last 30 days" tells you nothing about the trajectory. Were those orders evenly spaced? All in the first week? Accelerating? Decelerating?

orders (raw relational data)

order_id	customer_id	date	amount	week
O-401	C-9901	Mar 1	$89	Week 1
O-402	C-9901	Mar 3	$67	Week 1
O-403	C-9901	Mar 5	$45	Week 1
O-404	C-9901	Mar 14	$22	Week 2
O-405	C-9901	Mar 28	$12	Week 4
O-501	C-9903	Mar 2	$34	Week 1
O-502	C-9903	Mar 9	$45	Week 2
O-503	C-9903	Mar 16	$52	Week 3
O-504	C-9903	Mar 23	$68	Week 4
O-505	C-9903	Mar 30	$81	Week 4

Highlighted: C-9901 placed 3 orders in week 1 then trailed off. Amounts declined from $89 to $12. C-9903 placed 1 order per week with amounts growing from $34 to $81. Both have 5 orders in 30 days.

flat_feature_table (what the model sees)

customer_id	orders_30d	avg_order_value	total_spend	days_since_last	reality
C-9901	5	$47.00	$235	2	Disengaging (churn risk)
C-9903	5	$56.00	$280	0	Accelerating (growth)

Both customers show orders_30d = 5 and similar averages. The flat table cannot represent 'front-loaded and declining' vs 'evenly-spaced and growing.' The temporal trajectory is the signal. The count is noise.

The temporal sequence is highly predictive. A customer whose order frequency is declining by 15% per week is very different from one whose frequency is stable, even if the 30-day aggregate looks identical.

To capture this manually, you need to build features like order_frequency_week_1, order_frequency_week_2, order_frequency_week_3, order_frequency_week_4, and then a frequency_trend derived from the slope. For every metric, across every time granularity. This is where the 878 lines of code come from.

3. Cross-table interactions

The most predictive signals are often interactions between features from different tables. A customer with high usage but declining invoice amounts (tables: usage_events + invoices) is different from one with high usage and stable invoices. A customer who filed a support ticket right after a product downgrade (tables: support_tickets + subscriptions) has a different risk profile than one who filed a ticket for a different reason.

These cross-table interactions exist in the relational structure of the database. But after flattening, they only exist if someone thought to build them. And the number of possible interactions is the square of the number of base features.

What humans explore

50-200 features per task (4-17% of first-order space)
Mostly single-table aggregates
1-2 hop join paths
Pre-defined time windows (7d, 30d, 90d)
Manual interaction features (if any)

What the model sees

Full combinatorial feature space
Cross-table patterns across all tables
Multi-hop paths up to 4+ hops
Continuous temporal dynamics, not fixed windows
Learned interactions at every layer

Why the problem gets worse over time

The combinatorial explosion has two growth dimensions, and enterprises are expanding on both.

Database complexity is increasing

Modern enterprise databases are getting more connected, not less. Event streaming adds high-frequency tables. Product catalogs get more granular. Customer interaction data spans email, chat, web, mobile, and in-store. A database that had 5 tables and 50 columns in 2015 might have 15 tables and 200 columns in 2026.

The feature space scales superlinearly with schema complexity. Going from 5 tables to 15 tables does not triple the feature space. It increases it by an order of magnitude because of the combinatorial growth in join paths and interactions.

Prediction tasks are multiplying

Ten years ago, a data science team might maintain 3 to 5 models. Today, the same team is expected to deliver predictions for churn, upsell, cross-sell, fraud, credit risk, demand forecasting, personalization, campaign targeting, price optimization, and more. Each task requires its own round of feature engineering.

If each task takes 12.3 hours of feature engineering and a team maintains 20 tasks, that is 246 hours, or roughly 6 weeks of a senior data scientist's time, just on feature engineering. Before any modeling, evaluation, or deployment.

multi-hop_signals_missed

hop_depth	example_signal	tables_traversed	ever_built_manually
1 hop	Customer order count last 30d	customers > orders	Always
2 hops	Avg return rate of products purchased	customers > orders > returns	Sometimes
3 hops	Churn rate of similar-product buyers	customers > orders > products > orders > customers	Rarely
4 hops	Brand quality trend for preferred brands	customers > orders > products > brands > quality	Almost never

Highlighted: 3-hop and 4-hop signals are disproportionately predictive but almost never built by humans. The join paths are not intuitive. You have to think 'what if other people who bought the same things predict this person's behavior?' That is graph reasoning.

PQL Query

PREDICT churn
FOR EACH customers.customer_id
WITHIN 30 days

One line explores the entire combinatorial feature space. The foundation model automatically discovers multi-hop patterns, temporal sequences, and cross-table interactions that 12.3 hours of manual feature engineering misses.

Output

customer_id	churn_prob	top_multi_hop_signal	hops
C-9901	0.91	Products purchased have rising return rates from other buyers	3
C-9902	0.73	Preferred brand declining in quality metrics	4
C-9903	0.22	Similar-product buyers are stable	3
C-9904	0.08	Product and brand health strong across network	4

The structural solution

The solution to a combinatorial problem is not faster enumeration. It is a different approach that does not require enumeration at all.

Relational Deep Learning (ICML 2024) showed that you can represent a relational database as a temporal heterogeneous graph and train a graph neural network directly on that structure. The GNN explores the relational space through learned message-passing functions, effectively searching the combinatorial feature space during training.

KumoRFM goes further. As a foundation model pre-trained on thousands of databases, it has already learned which types of relational patterns are predictive. It does not need to search the feature space for your database from scratch. It recognizes recency effects, frequency patterns, temporal dynamics, and multi-hop graph structures because it has seen them thousands of times before.

The result: the time per prediction task drops from 12.3 hours to under 1 second. Not because the tool is faster, but because the problem it solves is different. It is not "generate features faster." It is "learn from relational data without generating features."

What this means for your team

If your data science team is spending 80% of their time on feature engineering, you are not under-tooled. You are applying a linear process to an exponential problem. Better SQL editors and feature stores will give you incremental improvements. Eliminating the feature engineering step gives you a structural one.

The combinatorial explosion is not going away. Your databases are getting more complex, and your business wants more prediction tasks. The only viable long-term strategy is to stop enumerating features and start using models that learn from relational structure directly.

That is not a sales pitch. It is the math.

Key Takeaways

1Feature engineering is slow because it is a combinatorial problem: 5 tables x 50 columns x 6 aggregations x 4 time windows = 1,200 first-order features, 719,400 pairwise interactions. Humans explore 4-17% of this space.
2Better tools do not help because the bottleneck is deciding what to build, not writing SQL faster. Data scientists spend 12.3 hours per task searching a combinatorial space, not executing queries.
3The most predictive signals are multi-hop (3-4 table joins) and temporal (sequences, not counts). These are exactly the patterns humans do not enumerate because the join paths are not intuitive.
4The problem gets worse over time. Database complexity is increasing (more tables, more columns) and prediction tasks are multiplying (churn, upsell, fraud, demand). The cost scales linearly with both dimensions.
5The solution is not faster enumeration. It is models that learn from relational structure without enumeration. KumoRFM explores the full combinatorial space in under 1 second by reading the graph directly.

Why Feature Engineering Takes So Long (It's Not What You Think)