Feature engineering is the art of converting raw data into numbers that machine learning models can learn from. It is called an "art" because, despite decades of practice, there is no systematic method for doing it well. A Stanford study measured what it costs: 12.3 hours and 878 lines of code per prediction task for experienced data scientists working on standard relational databases.
This guide is structured in three parts. First, the complete taxonomy of feature types and techniques. Second, the tools available to accelerate or automate the process. Third, the paradigm shift that makes the entire discipline optional for the most common enterprise ML use cases. For a focused overview of the bottleneck problem, see our companion article on feature engineering as the 80% bottleneck.
Part 1: Feature types and techniques
Numerical features
Numerical features start as numbers but often need transformation before models can use them effectively.
Scaling. Most ML algorithms (linear regression, SVMs, neural networks) are sensitive to feature scale. A feature ranging from 0-1,000,000 will dominate one ranging from 0-1. Standard approaches: min-max scaling (normalize to [0,1]), standardization (subtract mean, divide by standard deviation), and robust scaling (use median and IQR, resistant to outliers). Tree-based models (XGBoost, LightGBM) do not require scaling, which is one reason they dominate tabular ML.
Binning. Converting continuous values into discrete buckets. Age into age groups, income into income brackets, transaction amounts into small/medium/large. Binning sacrifices precision for noise reduction and can capture nonlinear thresholds (risk increases sharply above a specific credit utilization threshold). Equal-width bins, quantile bins, and domain-specific cutoffs are the common approaches.
Log and power transforms. Many financial and count variables are heavily right-skewed: most transactions are small, few are large. Log transformation compresses the tail and makes the distribution more symmetric, which helps linear models. Box-Cox and Yeo-Johnson transforms generalize this by learning the optimal power transformation from the data.
Interaction features. The product or ratio of two features can capture relationships that neither feature captures alone. Revenue per employee, price per square foot, clicks per impression. Domain knowledge guides which interactions to create. For n features, there are n(n-1)/2 pairwise interactions, which quickly becomes unmanageable without automated selection.
numerical_feature_techniques
| Technique | When to Use | Example | Risk |
|---|---|---|---|
| Min-Max Scaling | Neural networks, SVMs | Income: $0-500K -> 0.0-1.0 | Sensitive to outliers |
| Standardization | Linear models, PCA | Age: mean=35, std=12 -> z-scores | Assumes normality |
| Log Transform | Right-skewed distributions | Transaction: $1-$1M -> 0-6 | Fails on zero values |
| Binning | Nonlinear thresholds | Credit util: 0-30%, 30-70%, 70%+ | Information loss |
| Interaction Features | Domain-specific ratios | Revenue/Employee, Price/SqFt | Combinatorial explosion |
Tree-based models (XGBoost, LightGBM) do not require scaling. These techniques matter primarily for neural networks and linear models.
Categorical features
Machine learning models require numerical input. Categorical variables (country, product category, day of week) must be encoded as numbers without introducing false ordinal relationships.
One-hot encoding. Create a binary column for each category value. A "color" column with values red, blue, green becomes three binary columns. Simple and lossless, but impractical for high-cardinality features: a "user_id" column with 10 million unique values would create 10 million binary columns.
Target encoding. Replace each category with the mean of the target variable for that category. If users from zipcode 94105 have a 25% churn rate, encode 94105 as 0.25. Powerful but dangerous: without regularization (smoothing toward the global mean, adding noise, or using out-of-fold encoding), it causes severe overfitting because the model is directly seeing a leaky version of the target.
Frequency encoding. Replace each category with its occurrence count or frequency. Works well when the frequency itself is predictive (popular products have different return patterns than niche products) but fails when frequency is orthogonal to the target.
Embeddings. Learn a dense vector representation for each category value through a neural network. This is how modern recommendation systems handle user and item IDs: each ID maps to a 64-256 dimensional vector that captures the entity's behavior in compressed form. Requires sufficient data per category to learn meaningful representations.
categorical_encoding_methods
| Method | Max Cardinality | Pros | Cons |
|---|---|---|---|
| One-Hot | ~20 categories | Lossless, simple | Explodes with high cardinality |
| Target Encoding | Unlimited | Powerful signal | Overfitting risk without CV |
| Frequency Encoding | Unlimited | Simple, no leakage | Fails when frequency != signal |
| Hash Encoding | Unlimited | Fixed output size | Collisions lose information |
| Learned Embeddings | Unlimited | Best signal capture | Requires neural network |
For high-cardinality features (user IDs, SKUs), embeddings or target encoding with regularization are the standard approaches.
Temporal features
Time is one of the most information-dense dimensions in enterprise data, and one of the hardest to feature-engineer correctly.
Time-windowed aggregations. Count, sum, average, min, max of a metric over a time window. "Number of transactions in the last 7 days," "average order value in the last 90 days." The choice of window sizes is critical and usually determined through trial and error. A window too short captures noise; a window too long smooths out real changes.
Lag features. The value of a metric at a previous time step. "Revenue last month," "inventory level 4 weeks ago." Lag features capture autocorrelation, which is the tendency of metrics to persist over time.
Rolling statistics. Moving average, moving standard deviation, exponential moving average. These capture trends and volatility. A rising 30-day moving average of transaction amounts signals increasing engagement. Rising volatility signals instability.
Cyclical encoding. Days of the week, months, and hours are cyclical: Monday is close to Sunday, December is close to January. Encoding them as integers (1-7 or 1-12) introduces false distance between adjacent cycle endpoints. Sine/cosine encoding preserves cyclical proximity: encode day d as (sin(2πd/7), cos(2πd/7)).
Time-since features. Duration since an event. "Days since last purchase," "hours since last login," "weeks since last support ticket." These capture recency, one of the strongest predictive signals in most enterprise contexts.
Text features
Enterprise data increasingly includes text: support tickets, product descriptions, reviews, call transcripts, notes.
TF-IDF. Term frequency-inverse document frequency converts text into a sparse vector where each dimension represents a word, weighted by how important it is to a specific document relative to the corpus. Simple, interpretable, and effective for many classification tasks.
Word embeddings. Word2Vec, GloVe, or FastText map words to dense vectors that capture semantic similarity. "King" and "queen" have similar vectors. Text features are created by averaging word embeddings across the document.
Transformer embeddings. BERT, RoBERTa, or sentence transformers produce contextual embeddings that capture meaning at the sentence or document level. The current gold standard for text feature quality, but computationally expensive.
temporal_feature_examples
| Feature Type | SQL Example | Signal Captured | Information Lost |
|---|---|---|---|
| Window Aggregation | COUNT(*) OVER (7d) | Recent activity level | Sequence pattern |
| Lag Feature | revenue_last_month | Autocorrelation | Trend direction |
| Rolling Average | AVG(amount) OVER (30d) | Trend level | Volatility spikes |
| Time-Since | DATEDIFF(NOW(), last_order) | Recency signal | Frequency context |
| Cyclical Encoding | sin(2*pi*day/7) | Day-of-week effect | None (lossless) |
Each temporal technique captures one signal while destroying others. A foundation model preserves the full temporal sequence.
Relational features
Relational features are the most valuable and the most difficult to engineer. They capture patterns across multiple connected tables.
Cross-table aggregations. "Total spend by this customer across all product categories," "number of distinct merchants this card has transacted with," "average rating of products purchased by this user." Each requires a join, a groupby, and an aggregation function. For n tables with m join paths, the number of possible aggregations scales combinatorially.
customers
| customer_id | name | segment | signup_date |
|---|---|---|---|
| C-101 | Acme Corp | Enterprise | 2023-01-15 |
| C-102 | Beta LLC | Mid-Market | 2023-06-20 |
orders
| order_id | customer_id | product_id | amount | date |
|---|---|---|---|---|
| O-501 | C-101 | P-10 | $2,400 | 2025-01-10 |
| O-502 | C-101 | P-11 | $800 | 2025-02-15 |
| O-503 | C-102 | P-10 | $2,400 | 2025-01-22 |
| O-504 | C-102 | P-12 | $150 | 2025-03-01 |
products
| product_id | name | category | avg_review |
|---|---|---|---|
| P-10 | Platform License | Core | 4.5 |
| P-11 | Analytics Add-On | Expansion | 4.8 |
| P-12 | Basic Reports | Self-Serve | 2.9 |
flat_feature_table (what gets built from the 3 tables above)
| customer_id | order_count | total_spend | avg_order | product_count | churned? |
|---|---|---|---|---|---|
| C-101 | 2 | $3,200 | $1,600 | 2 | No |
| C-102 | 2 | $2,550 | $1,275 | 2 | Yes |
Both customers have 2 orders and 2 products. The flat table cannot show that C-101 bought the expansion add-on (4.8 rating) while C-102 bought the low-rated self-serve product (2.9 rating). The product quality signal -- a strong churn predictor -- is lost.
Multi-hop features. "Average churn rate of customers who bought the same products as this customer." This requires joining through customer → orders → products → orders (of other customers) → customer churn status. A 4-table join that most data scientists never write because they never think to look for it.
Graph features. Node degree (how many connections), centrality metrics (how important is this node in the network), community membership, and PageRank-style influence scores. These require computing graph metrics on the relational structure, which is outside the typical SQL workflow.
PQL Query
PREDICT COUNT(orders.*, 0, 30) > 0 FOR EACH customers.customer_id
Instead of engineering relational features manually, a foundation model discovers all cross-table patterns automatically. This PQL query replaces hundreds of lines of SQL joins and aggregations.
Output
| customer_id | prediction | confidence | top_cross_table_signal |
|---|---|---|---|
| C-4821 | 0.87 | high | Similar customers churning (3-hop) |
| C-1093 | 0.42 | medium | Product return rate rising (2-hop) |
| C-7756 | 0.15 | high | Support CSAT above avg (2-hop) |
| C-3302 | 0.94 | high | Payment method removed (1-hop) |
Part 2: Tools for feature engineering
SQL and Python
The baseline. Most feature engineering happens in SQL (for cross-table joins and aggregations) and Python/Pandas (for transformations and encoding). This is the most flexible approach but the most labor-intensive. The 12.3 hours and 878 lines of code from the Stanford study represent this approach.
Featuretools
The leading automated feature engineering library for relational data. Featuretools implements "deep feature synthesis": given a set of tables and their relationships, it automatically generates features by applying aggregation primitives (sum, count, mean, max, min) across join paths, and transform primitives (year, month, weekday) on individual columns. It can stack these operations to create multi-hop features.
The strength: it dramatically reduces manual effort and can discover features that humans overlook. The weakness: feature explosion. On a 5-table database, Featuretools can generate 10,000+ features, most of which are noise. Feature selection becomes the new bottleneck.
Feature stores
Tecton, Feast, and Hopsworks solve the operational side of feature management: storing feature values, serving them at low latency for real-time inference, ensuring consistency between training and serving, and tracking feature lineage and versioning. Enterprise feature stores cost $100K-500K/year.
Feature stores are essential for production ML systems. But they are infrastructure for managing features, not creating them. A feature store without good features is an expensive, empty database.
Domain-specific tools
tsfresh: Automated extraction of 794 time-series features with built-in significance testing. Effective for sensor data and IoT applications.
Feature Engine: A scikit-learn compatible library with transformers for encoding, discretization, outlier handling, and feature selection. Clean API for building reproducible feature engineering pipelines.
Category Encoders: 15+ encoding methods for categorical variables, from simple (one-hot, ordinal) to advanced (target encoding, leave-one-out, CatBoost encoding).
Manual feature engineering
- 12.3 hours and 878 lines of code per task
- Explores tiny fraction of feature space
- Multi-hop features rarely attempted
- Temporal sequences destroyed by aggregation
- Each new task starts from scratch
Foundation model (no feature engineering)
- Connect database, specify target, get predictions
- Full relational structure explored automatically
- Multi-hop patterns captured by graph attention
- Temporal sequences preserved and learned from
- Same model handles any prediction task
Part 3: The paradigm shift
Every technique and tool described above shares an assumption: raw data must be converted into a flat numerical table before a model can learn from it. This assumption was true for every ML algorithm from logistic regression through gradient-boosted trees. It is no longer true.
Relational Deep Learning, published at ICML 2024, demonstrated that relational databases can be represented as temporal heterogeneous graphs (rows are nodes, foreign keys are edges, timestamps order events) and that graph neural networks trained on this structure outperform manually engineered features.
The RelBench benchmark quantifies the gap. Across 7 databases, 30 tasks, and 103 million rows: LightGBM with manually engineered features achieves 62.44 AUROC. A supervised GNN on the relational graph achieves 75.83 AUROC. KumoRFM zero-shot achieves 76.71 AUROC. KumoRFM fine-tuned achieves 81.14 AUROC.
The model that reads raw relational data outperforms the model that uses carefully engineered features by 14+ AUROC points. And it does so with zero feature engineering, zero per-task training (in the zero-shot case), and zero SQL.
relbench_feature_engineering_vs_learned
| Approach | AUROC (Avg) | Features Explored | Time per Task | LOC per Task |
|---|---|---|---|---|
| Manual FE + LightGBM | 62.44 | 50-200 (human-selected) | 12.3 hours | 878 |
| Featuretools + LightGBM | ~64-66 | 1,000-10,000 (auto) | 2-4 hours | 50-100 |
| Supervised GNN | 75.83 | All (graph-learned) | Hours-days | ~200 |
| KumoRFM zero-shot | 76.71 | All (pre-trained) | Seconds | 1 |
| KumoRFM fine-tuned | 81.14 | All (adapted) | 2-8 hours | 1 |
RelBench benchmark: 7 databases, 30 tasks, 103M+ rows. The model that skips feature engineering outperforms the model built on manually engineered features.
Why the model wins
Three structural advantages explain the gap.
Coverage. A human data scientist explores dozens to hundreds of features. A graph transformer attends to every relationship in the subgraph. It evaluates multi-hop paths, temporal sequences, and cross-table interactions that no human would enumerate.
Preservation. Feature engineering destroys information through aggregation. "Five orders in 30 days" loses the sequence, the spacing, the trend. The graph transformer sees the raw sequence of events with timestamps, preserving the temporal dynamics.
Transfer. A pre-trained foundation model has seen patterns across thousands of databases. It knows that recency, frequency, monetary value, graph centrality, and temporal acceleration are predictive universally. When it sees a new database, it applies these learned priors. A human data scientist on a new database starts from scratch.
When you still need feature engineering
Foundation models do not eliminate feature engineering for every ML task. Computer vision models still need image preprocessing (resizing, normalization, augmentation). NLP models still need tokenization and text cleaning. Reinforcement learning environments still need state representation engineering.
For structured, relational, and tabular data, which accounts for an estimated 80% of enterprise ML use cases, the evidence is clear: learning directly from the data structure outperforms manually engineering features from it. If your data lives in a relational database and your goal is prediction, feature engineering is now optional.
Practical recommendations
If you are starting a new prediction project on relational data: try a foundation model first. It will give you a baseline in minutes. If the accuracy meets your needs, you are done. If it does not, the foundation model's results tell you how much room for improvement exists and whether custom feature engineering is likely to close the gap.
If you have existing ML pipelines with extensive feature engineering: benchmark them against a foundation model on the same tasks. In many cases, the foundation model matches or exceeds the custom pipeline's accuracy while eliminating the maintenance burden. Migrating saves $300K-500K per year per use case in maintenance costs alone.
If you are teaching or learning feature engineering: understand the techniques. They remain valuable for building intuition about data and for the domains where they are still necessary. But also understand that for relational data, the field is moving toward models that learn features automatically, and the practitioner who can leverage both approaches will be more effective than one who can only do manual engineering.
The complete guide to feature engineering ends with a surprising conclusion: for the majority of enterprise ML, the best feature engineering strategy is no feature engineering at all.
PQL Query
PREDICT AVG(orders.amount, 0, 90) FOR EACH products.product_id
Predicting 90-day average order value per product. The model automatically discovers which customer segments, seasonal patterns, and cross-product signals drive the prediction.
Output
| product_id | predicted_avg_value | current_avg | key_driver |
|---|---|---|---|
| SKU-8847 | $142.30 | $128.50 | Holiday season approaching |
| SKU-2291 | $34.80 | $67.20 | Competitor price drop detected |
| SKU-5510 | $89.00 | $85.00 | Repeat purchasers increasing |
| SKU-1123 | $210.50 | $195.00 | Bundle attachment rate rising |