What are the main types of feature engineering?

The main types are: numerical features (scaling, binning, log transforms, polynomial features, interaction features), categorical features (one-hot encoding, target encoding, frequency encoding, embedding), temporal features (time-windowed aggregations, lag features, rolling statistics, cyclical encoding for day/month), text features (TF-IDF, word embeddings, n-grams), and relational features (cross-table aggregations, graph-based features, multi-hop path features). Relational features are the most time-consuming and the most predictive in enterprise settings.

What is the difference between a feature store and feature engineering?

Feature engineering is the process of creating features from raw data. A feature store is infrastructure for storing, serving, and managing features after they are created. Feature stores (Tecton, Feast, Hopsworks) solve the operational problem of serving features at low latency with training-serving consistency. They do not solve the creation problem. Someone still has to decide what features to build. A feature store without good features is an empty warehouse.

What tools are available for automated feature engineering?

The main tools are: Featuretools (Python library for automated feature generation on relational data, generates features through deep feature synthesis), tsfresh (automated time-series feature extraction), Feature Engine (scikit-learn compatible transformers), and AutoFeat (automatic feature engineering and selection). These tools generate features programmatically but face a feature explosion problem: they create thousands of features, most of which are noise, requiring another round of feature selection.

How do you handle high-cardinality categorical features?

High-cardinality categoricals (user IDs, product IDs, zip codes) cannot use one-hot encoding because it creates too many columns. Common approaches: target encoding (replace each category with its mean target value, with regularization to prevent overfitting), frequency encoding (replace with occurrence count), hash encoding (map to fixed-size space using hash function), and learned embeddings (train a neural network to learn dense vector representations). For relational data, foundation models handle this automatically through their graph representation.

Is feature engineering still necessary with foundation models?

For prediction tasks on relational/tabular data, no. Relational foundation models like KumoRFM learn directly from raw database tables without any feature engineering. They represent the database as a temporal graph and use graph transformers to automatically discover predictive patterns. On the RelBench benchmark, this zero-shot approach outperforms manual feature engineering by experienced data scientists. Feature engineering remains necessary for specialized domains (computer vision, NLP) where foundation models need domain-specific input preprocessing.

The Complete Guide to Feature Engineering (And Why You Might Not Need It) | Kumo.ai

Feature engineering is the art of converting raw data into numbers that machine learning models can learn from. It is called an "art" because, despite decades of practice, there is no systematic method for doing it well. A Stanford study measured what it costs: 12.3 hours and 878 lines of code per prediction task for experienced data scientists working on standard relational databases.

This guide is structured in three parts. First, the complete taxonomy of feature types and techniques. Second, the tools available to accelerate or automate the process. Third, the paradigm shift that makes the entire discipline optional for the most common enterprise ML use cases. For a focused overview of the bottleneck problem, see our companion article on feature engineering as the 80% bottleneck.

Part 1: Feature types and techniques

Numerical features

Numerical features start as numbers but often need transformation before models can use them effectively.

Scaling. Most ML algorithms (linear regression, SVMs, neural networks) are sensitive to feature scale. A feature ranging from 0-1,000,000 will dominate one ranging from 0-1. Standard approaches: min-max scaling (normalize to [0,1]), standardization (subtract mean, divide by standard deviation), and robust scaling (use median and IQR, resistant to outliers). Tree-based models (XGBoost, LightGBM) do not require scaling, which is one reason they dominate tabular ML.

Binning. Converting continuous values into discrete buckets. Age into age groups, income into income brackets, transaction amounts into small/medium/large. Binning sacrifices precision for noise reduction and can capture nonlinear thresholds (risk increases sharply above a specific credit utilization threshold). Equal-width bins, quantile bins, and domain-specific cutoffs are the common approaches.

Log and power transforms. Many financial and count variables are heavily right-skewed: most transactions are small, few are large. Log transformation compresses the tail and makes the distribution more symmetric, which helps linear models. Box-Cox and Yeo-Johnson transforms generalize this by learning the optimal power transformation from the data.

Interaction features. The product or ratio of two features can capture relationships that neither feature captures alone. Revenue per employee, price per square foot, clicks per impression. Domain knowledge guides which interactions to create. For n features, there are n(n-1)/2 pairwise interactions, which quickly becomes unmanageable without automated selection.

numerical_feature_techniques

Technique	When to Use	Example	Risk
Min-Max Scaling	Neural networks, SVMs	Income: $0-500K -> 0.0-1.0	Sensitive to outliers
Standardization	Linear models, PCA	Age: mean=35, std=12 -> z-scores	Assumes normality
Log Transform	Right-skewed distributions	Transaction: $1-$1M -> 0-6	Fails on zero values
Binning	Nonlinear thresholds	Credit util: 0-30%, 30-70%, 70%+	Information loss
Interaction Features	Domain-specific ratios	Revenue/Employee, Price/SqFt	Combinatorial explosion

Tree-based models (XGBoost, LightGBM) do not require scaling. These techniques matter primarily for neural networks and linear models.

Categorical features

Machine learning models require numerical input. Categorical variables (country, product category, day of week) must be encoded as numbers without introducing false ordinal relationships.

One-hot encoding. Create a binary column for each category value. A "color" column with values red, blue, green becomes three binary columns. Simple and lossless, but impractical for high-cardinality features: a "user_id" column with 10 million unique values would create 10 million binary columns.

Target encoding. Replace each category with the mean of the target variable for that category. If users from zipcode 94105 have a 25% churn rate, encode 94105 as 0.25. Powerful but dangerous: without regularization (smoothing toward the global mean, adding noise, or using out-of-fold encoding), it causes severe overfitting because the model is directly seeing a leaky version of the target.

Frequency encoding. Replace each category with its occurrence count or frequency. Works well when the frequency itself is predictive (popular products have different return patterns than niche products) but fails when frequency is orthogonal to the target.

Embeddings. Learn a dense vector representation for each category value through a neural network. This is how modern recommendation systems handle user and item IDs: each ID maps to a 64-256 dimensional vector that captures the entity's behavior in compressed form. Requires sufficient data per category to learn meaningful representations.

categorical_encoding_methods

Method	Max Cardinality	Pros	Cons
One-Hot	~20 categories	Lossless, simple	Explodes with high cardinality
Target Encoding	Unlimited	Powerful signal	Overfitting risk without CV
Frequency Encoding	Unlimited	Simple, no leakage	Fails when frequency != signal
Hash Encoding	Unlimited	Fixed output size	Collisions lose information
Learned Embeddings	Unlimited	Best signal capture	Requires neural network

For high-cardinality features (user IDs, SKUs), embeddings or target encoding with regularization are the standard approaches.

Temporal features

Time is one of the most information-dense dimensions in enterprise data, and one of the hardest to feature-engineer correctly.

Time-windowed aggregations. Count, sum, average, min, max of a metric over a time window. "Number of transactions in the last 7 days," "average order value in the last 90 days." The choice of window sizes is critical and usually determined through trial and error. A window too short captures noise; a window too long smooths out real changes.

Lag features. The value of a metric at a previous time step. "Revenue last month," "inventory level 4 weeks ago." Lag features capture autocorrelation, which is the tendency of metrics to persist over time.

Rolling statistics. Moving average, moving standard deviation, exponential moving average. These capture trends and volatility. A rising 30-day moving average of transaction amounts signals increasing engagement. Rising volatility signals instability.

Cyclical encoding. Days of the week, months, and hours are cyclical: Monday is close to Sunday, December is close to January. Encoding them as integers (1-7 or 1-12) introduces false distance between adjacent cycle endpoints. Sine/cosine encoding preserves cyclical proximity: encode day d as (sin(2πd/7), cos(2πd/7)).

Time-since features. Duration since an event. "Days since last purchase," "hours since last login," "weeks since last support ticket." These capture recency, one of the strongest predictive signals in most enterprise contexts.

Text features

Enterprise data increasingly includes text: support tickets, product descriptions, reviews, call transcripts, notes.

TF-IDF. Term frequency-inverse document frequency converts text into a sparse vector where each dimension represents a word, weighted by how important it is to a specific document relative to the corpus. Simple, interpretable, and effective for many classification tasks.

Word embeddings. Word2Vec, GloVe, or FastText map words to dense vectors that capture semantic similarity. "King" and "queen" have similar vectors. Text features are created by averaging word embeddings across the document.

Transformer embeddings. BERT, RoBERTa, or sentence transformers produce contextual embeddings that capture meaning at the sentence or document level. The current gold standard for text feature quality, but computationally expensive.

temporal_feature_examples

Feature Type	SQL Example	Signal Captured	Information Lost
Window Aggregation	COUNT(*) OVER (7d)	Recent activity level	Sequence pattern
Lag Feature	revenue_last_month	Autocorrelation	Trend direction
Rolling Average	AVG(amount) OVER (30d)	Trend level	Volatility spikes
Time-Since	DATEDIFF(NOW(), last_order)	Recency signal	Frequency context
Cyclical Encoding	sin(2piday/7)	Day-of-week effect	None (lossless)

Each temporal technique captures one signal while destroying others. A foundation model preserves the full temporal sequence.

Relational features

Relational features are the most valuable and the most difficult to engineer. They capture patterns across multiple connected tables.

Cross-table aggregations. "Total spend by this customer across all product categories," "number of distinct merchants this card has transacted with," "average rating of products purchased by this user." Each requires a join, a groupby, and an aggregation function. For n tables with m join paths, the number of possible aggregations scales combinatorially.

customers

customer_id	name	segment	signup_date
C-101	Acme Corp	Enterprise	2023-01-15
C-102	Beta LLC	Mid-Market	2023-06-20

orders

order_id	customer_id	product_id	amount	date
O-501	C-101	P-10	$2,400	2025-01-10
O-502	C-101	P-11	$800	2025-02-15
O-503	C-102	P-10	$2,400	2025-01-22
O-504	C-102	P-12	$150	2025-03-01

products

product_id	name	category	avg_review
P-10	Platform License	Core	4.5
P-11	Analytics Add-On	Expansion	4.8
P-12	Basic Reports	Self-Serve	2.9

flat_feature_table (what gets built from the 3 tables above)

customer_id	order_count	total_spend	avg_order	product_count	churned?
C-101	2	$3,200	$1,600	2	No
C-102	2	$2,550	$1,275	2	Yes

Both customers have 2 orders and 2 products. The flat table cannot show that C-101 bought the expansion add-on (4.8 rating) while C-102 bought the low-rated self-serve product (2.9 rating). The product quality signal -- a strong churn predictor -- is lost.

Multi-hop features. "Average churn rate of customers who bought the same products as this customer." This requires joining through customer → orders → products → orders (of other customers) → customer churn status. A 4-table join that most data scientists never write because they never think to look for it.

Graph features. Node degree (how many connections), centrality metrics (how important is this node in the network), community membership, and PageRank-style influence scores. These require computing graph metrics on the relational structure, which is outside the typical SQL workflow.

PQL Query

PREDICT COUNT(orders.*, 0, 30) > 0
FOR EACH customers.customer_id

Instead of engineering relational features manually, a foundation model discovers all cross-table patterns automatically. This PQL query replaces hundreds of lines of SQL joins and aggregations.

Output

customer_id	prediction	confidence	top_cross_table_signal
C-4821	0.87	high	Similar customers churning (3-hop)
C-1093	0.42	medium	Product return rate rising (2-hop)
C-7756	0.15	high	Support CSAT above avg (2-hop)
C-3302	0.94	high	Payment method removed (1-hop)

Part 2: Tools for feature engineering

SQL and Python

The baseline. Most feature engineering happens in SQL (for cross-table joins and aggregations) and Python/Pandas (for transformations and encoding). This is the most flexible approach but the most labor-intensive. The 12.3 hours and 878 lines of code from the Stanford study represent this approach.

Featuretools

The leading automated feature engineering library for relational data. Featuretools implements "deep feature synthesis": given a set of tables and their relationships, it automatically generates features by applying aggregation primitives (sum, count, mean, max, min) across join paths, and transform primitives (year, month, weekday) on individual columns. It can stack these operations to create multi-hop features.

The strength: it dramatically reduces manual effort and can discover features that humans overlook. The weakness: feature explosion. On a 5-table database, Featuretools can generate 10,000+ features, most of which are noise. Feature selection becomes the new bottleneck.

Feature stores

Tecton, Feast, and Hopsworks solve the operational side of feature management: storing feature values, serving them at low latency for real-time inference, ensuring consistency between training and serving, and tracking feature lineage and versioning. Enterprise feature stores cost $100K-500K/year.

Feature stores are essential for production ML systems. But they are infrastructure for managing features, not creating them. A feature store without good features is an expensive, empty database.

Domain-specific tools

tsfresh: Automated extraction of 794 time-series features with built-in significance testing. Effective for sensor data and IoT applications.

Feature Engine: A scikit-learn compatible library with transformers for encoding, discretization, outlier handling, and feature selection. Clean API for building reproducible feature engineering pipelines.

Category Encoders: 15+ encoding methods for categorical variables, from simple (one-hot, ordinal) to advanced (target encoding, leave-one-out, CatBoost encoding).

Manual feature engineering

12.3 hours and 878 lines of code per task
Explores tiny fraction of feature space
Multi-hop features rarely attempted
Temporal sequences destroyed by aggregation
Each new task starts from scratch

Foundation model (no feature engineering)

Connect database, specify target, get predictions
Full relational structure explored automatically
Multi-hop patterns captured by graph attention
Temporal sequences preserved and learned from
Same model handles any prediction task

Part 3: The paradigm shift

Every technique and tool described above shares an assumption: raw data must be converted into a flat numerical table before a model can learn from it. This assumption was true for every ML algorithm from logistic regression through gradient-boosted trees. It is no longer true.

Relational Deep Learning, published at ICML 2024, demonstrated that relational databases can be represented as temporal heterogeneous graphs (rows are nodes, foreign keys are edges, timestamps order events) and that graph neural networks trained on this structure outperform manually engineered features.

The RelBench benchmark quantifies the gap. Across 7 databases, 30 tasks, and 103 million rows: LightGBM with manually engineered features achieves 62.44 AUROC. A supervised GNN on the relational graph achieves 75.83 AUROC. KumoRFM zero-shot achieves 76.71 AUROC. KumoRFM fine-tuned achieves 81.14 AUROC.

The model that reads raw relational data outperforms the model that uses carefully engineered features by 14+ AUROC points. And it does so with zero feature engineering, zero per-task training (in the zero-shot case), and zero SQL.

relbench_feature_engineering_vs_learned

Approach	AUROC (Avg)	Features Explored	Time per Task	LOC per Task
Manual FE + LightGBM	62.44	50-200 (human-selected)	12.3 hours	878
Featuretools + LightGBM	~64-66	1,000-10,000 (auto)	2-4 hours	50-100
Supervised GNN	75.83	All (graph-learned)	Hours-days	~200
KumoRFM zero-shot	76.71	All (pre-trained)	Seconds	1
KumoRFM fine-tuned	81.14	All (adapted)	2-8 hours	1

RelBench benchmark: 7 databases, 30 tasks, 103M+ rows. The model that skips feature engineering outperforms the model built on manually engineered features.

Why the model wins

Three structural advantages explain the gap.

Coverage. A human data scientist explores dozens to hundreds of features. A graph transformer attends to every relationship in the subgraph. It evaluates multi-hop paths, temporal sequences, and cross-table interactions that no human would enumerate.

Preservation. Feature engineering destroys information through aggregation. "Five orders in 30 days" loses the sequence, the spacing, the trend. The graph transformer sees the raw sequence of events with timestamps, preserving the temporal dynamics.

Transfer. A pre-trained foundation model has seen patterns across thousands of databases. It knows that recency, frequency, monetary value, graph centrality, and temporal acceleration are predictive universally. When it sees a new database, it applies these learned priors. A human data scientist on a new database starts from scratch.

When you still need feature engineering

Foundation models do not eliminate feature engineering for every ML task. Computer vision models still need image preprocessing (resizing, normalization, augmentation). NLP models still need tokenization and text cleaning. Reinforcement learning environments still need state representation engineering.

For structured, relational, and tabular data, which accounts for an estimated 80% of enterprise ML use cases, the evidence is clear: learning directly from the data structure outperforms manually engineering features from it. If your data lives in a relational database and your goal is prediction, feature engineering is now optional.

Practical recommendations

If you are starting a new prediction project on relational data: try a foundation model first. It will give you a baseline in minutes. If the accuracy meets your needs, you are done. If it does not, the foundation model's results tell you how much room for improvement exists and whether custom feature engineering is likely to close the gap.

If you have existing ML pipelines with extensive feature engineering: benchmark them against a foundation model on the same tasks. In many cases, the foundation model matches or exceeds the custom pipeline's accuracy while eliminating the maintenance burden. Migrating saves $300K-500K per year per use case in maintenance costs alone.

If you are teaching or learning feature engineering: understand the techniques. They remain valuable for building intuition about data and for the domains where they are still necessary. But also understand that for relational data, the field is moving toward models that learn features automatically, and the practitioner who can leverage both approaches will be more effective than one who can only do manual engineering.

The complete guide to feature engineering ends with a surprising conclusion: for the majority of enterprise ML, the best feature engineering strategy is no feature engineering at all.