Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
Learn19 min read

The Complete Guide to Feature Engineering (And Why You Might Not Need It)

Feature engineering has been the defining activity of applied machine learning for two decades. This guide covers every technique, every tool, and every trap. It also covers why the best approach to feature engineering may be to skip it entirely.

TL;DR

  • 1Feature engineering spans five types: numerical (scaling, binning, transforms), categorical (one-hot, target encoding, embeddings), temporal (windows, lags, cyclical), text (TF-IDF, transformer embeddings), and relational (cross-table, multi-hop, graph). Relational features are the most valuable and most time-consuming.
  • 2A Stanford study measured the cost: 12.3 hours and 878 lines of code per prediction task. Humans explore fewer than 5% of the possible feature space across multi-table databases.
  • 3Tools at every level (SQL/Python, Featuretools, feature stores) help but do not solve the core bottleneck. Featuretools generates 10,000+ features from a 5-table database, creating a feature explosion problem that shifts the bottleneck to selection.
  • 4On RelBench (7 databases, 30 tasks, 103M+ rows), KumoRFM zero-shot (76.71 AUROC) outperforms manual feature engineering + LightGBM (62.44 AUROC) by learning directly from raw relational data. Fine-tuned reaches 81.14.
  • 5For 80% of enterprise ML use cases on relational data, the best feature engineering strategy is no feature engineering at all. The paradigm shift: from engineering features to learning representations.

Feature engineering is the art of converting raw data into numbers that machine learning models can learn from. It is called an "art" because, despite decades of practice, there is no systematic method for doing it well. A Stanford study measured what it costs: 12.3 hours and 878 lines of code per prediction task for experienced data scientists working on standard relational databases.

This guide is structured in three parts. First, the complete taxonomy of feature types and techniques. Second, the tools available to accelerate or automate the process. Third, the paradigm shift that makes the entire discipline optional for the most common enterprise ML use cases. For a focused overview of the bottleneck problem, see our companion article on feature engineering as the 80% bottleneck.

Part 1: Feature types and techniques

Numerical features

Numerical features start as numbers but often need transformation before models can use them effectively.

Scaling. Most ML algorithms (linear regression, SVMs, neural networks) are sensitive to feature scale. A feature ranging from 0-1,000,000 will dominate one ranging from 0-1. Standard approaches: min-max scaling (normalize to [0,1]), standardization (subtract mean, divide by standard deviation), and robust scaling (use median and IQR, resistant to outliers). Tree-based models (XGBoost, LightGBM) do not require scaling, which is one reason they dominate tabular ML.

Binning. Converting continuous values into discrete buckets. Age into age groups, income into income brackets, transaction amounts into small/medium/large. Binning sacrifices precision for noise reduction and can capture nonlinear thresholds (risk increases sharply above a specific credit utilization threshold). Equal-width bins, quantile bins, and domain-specific cutoffs are the common approaches.

Log and power transforms. Many financial and count variables are heavily right-skewed: most transactions are small, few are large. Log transformation compresses the tail and makes the distribution more symmetric, which helps linear models. Box-Cox and Yeo-Johnson transforms generalize this by learning the optimal power transformation from the data.

Interaction features. The product or ratio of two features can capture relationships that neither feature captures alone. Revenue per employee, price per square foot, clicks per impression. Domain knowledge guides which interactions to create. For n features, there are n(n-1)/2 pairwise interactions, which quickly becomes unmanageable without automated selection.

numerical_feature_techniques

TechniqueWhen to UseExampleRisk
Min-Max ScalingNeural networks, SVMsIncome: $0-500K -> 0.0-1.0Sensitive to outliers
StandardizationLinear models, PCAAge: mean=35, std=12 -> z-scoresAssumes normality
Log TransformRight-skewed distributionsTransaction: $1-$1M -> 0-6Fails on zero values
BinningNonlinear thresholdsCredit util: 0-30%, 30-70%, 70%+Information loss
Interaction FeaturesDomain-specific ratiosRevenue/Employee, Price/SqFtCombinatorial explosion

Tree-based models (XGBoost, LightGBM) do not require scaling. These techniques matter primarily for neural networks and linear models.

Categorical features

Machine learning models require numerical input. Categorical variables (country, product category, day of week) must be encoded as numbers without introducing false ordinal relationships.

One-hot encoding. Create a binary column for each category value. A "color" column with values red, blue, green becomes three binary columns. Simple and lossless, but impractical for high-cardinality features: a "user_id" column with 10 million unique values would create 10 million binary columns.

Target encoding. Replace each category with the mean of the target variable for that category. If users from zipcode 94105 have a 25% churn rate, encode 94105 as 0.25. Powerful but dangerous: without regularization (smoothing toward the global mean, adding noise, or using out-of-fold encoding), it causes severe overfitting because the model is directly seeing a leaky version of the target.

Frequency encoding. Replace each category with its occurrence count or frequency. Works well when the frequency itself is predictive (popular products have different return patterns than niche products) but fails when frequency is orthogonal to the target.

Embeddings. Learn a dense vector representation for each category value through a neural network. This is how modern recommendation systems handle user and item IDs: each ID maps to a 64-256 dimensional vector that captures the entity's behavior in compressed form. Requires sufficient data per category to learn meaningful representations.

categorical_encoding_methods

MethodMax CardinalityProsCons
One-Hot~20 categoriesLossless, simpleExplodes with high cardinality
Target EncodingUnlimitedPowerful signalOverfitting risk without CV
Frequency EncodingUnlimitedSimple, no leakageFails when frequency != signal
Hash EncodingUnlimitedFixed output sizeCollisions lose information
Learned EmbeddingsUnlimitedBest signal captureRequires neural network

For high-cardinality features (user IDs, SKUs), embeddings or target encoding with regularization are the standard approaches.

Temporal features

Time is one of the most information-dense dimensions in enterprise data, and one of the hardest to feature-engineer correctly.

Time-windowed aggregations. Count, sum, average, min, max of a metric over a time window. "Number of transactions in the last 7 days," "average order value in the last 90 days." The choice of window sizes is critical and usually determined through trial and error. A window too short captures noise; a window too long smooths out real changes.

Lag features. The value of a metric at a previous time step. "Revenue last month," "inventory level 4 weeks ago." Lag features capture autocorrelation, which is the tendency of metrics to persist over time.

Rolling statistics. Moving average, moving standard deviation, exponential moving average. These capture trends and volatility. A rising 30-day moving average of transaction amounts signals increasing engagement. Rising volatility signals instability.

Cyclical encoding. Days of the week, months, and hours are cyclical: Monday is close to Sunday, December is close to January. Encoding them as integers (1-7 or 1-12) introduces false distance between adjacent cycle endpoints. Sine/cosine encoding preserves cyclical proximity: encode day d as (sin(2πd/7), cos(2πd/7)).

Time-since features. Duration since an event. "Days since last purchase," "hours since last login," "weeks since last support ticket." These capture recency, one of the strongest predictive signals in most enterprise contexts.

Text features

Enterprise data increasingly includes text: support tickets, product descriptions, reviews, call transcripts, notes.

TF-IDF. Term frequency-inverse document frequency converts text into a sparse vector where each dimension represents a word, weighted by how important it is to a specific document relative to the corpus. Simple, interpretable, and effective for many classification tasks.

Word embeddings. Word2Vec, GloVe, or FastText map words to dense vectors that capture semantic similarity. "King" and "queen" have similar vectors. Text features are created by averaging word embeddings across the document.

Transformer embeddings. BERT, RoBERTa, or sentence transformers produce contextual embeddings that capture meaning at the sentence or document level. The current gold standard for text feature quality, but computationally expensive.

temporal_feature_examples

Feature TypeSQL ExampleSignal CapturedInformation Lost
Window AggregationCOUNT(*) OVER (7d)Recent activity levelSequence pattern
Lag Featurerevenue_last_monthAutocorrelationTrend direction
Rolling AverageAVG(amount) OVER (30d)Trend levelVolatility spikes
Time-SinceDATEDIFF(NOW(), last_order)Recency signalFrequency context
Cyclical Encodingsin(2*pi*day/7)Day-of-week effectNone (lossless)

Each temporal technique captures one signal while destroying others. A foundation model preserves the full temporal sequence.

Relational features

Relational features are the most valuable and the most difficult to engineer. They capture patterns across multiple connected tables.

Cross-table aggregations. "Total spend by this customer across all product categories," "number of distinct merchants this card has transacted with," "average rating of products purchased by this user." Each requires a join, a groupby, and an aggregation function. For n tables with m join paths, the number of possible aggregations scales combinatorially.

customers

customer_idnamesegmentsignup_date
C-101Acme CorpEnterprise2023-01-15
C-102Beta LLCMid-Market2023-06-20

orders

order_idcustomer_idproduct_idamountdate
O-501C-101P-10$2,4002025-01-10
O-502C-101P-11$8002025-02-15
O-503C-102P-10$2,4002025-01-22
O-504C-102P-12$1502025-03-01

products

product_idnamecategoryavg_review
P-10Platform LicenseCore4.5
P-11Analytics Add-OnExpansion4.8
P-12Basic ReportsSelf-Serve2.9

flat_feature_table (what gets built from the 3 tables above)

customer_idorder_counttotal_spendavg_orderproduct_countchurned?
C-1012$3,200$1,6002No
C-1022$2,550$1,2752Yes

Both customers have 2 orders and 2 products. The flat table cannot show that C-101 bought the expansion add-on (4.8 rating) while C-102 bought the low-rated self-serve product (2.9 rating). The product quality signal -- a strong churn predictor -- is lost.

Multi-hop features. "Average churn rate of customers who bought the same products as this customer." This requires joining through customer → orders → products → orders (of other customers) → customer churn status. A 4-table join that most data scientists never write because they never think to look for it.

Graph features. Node degree (how many connections), centrality metrics (how important is this node in the network), community membership, and PageRank-style influence scores. These require computing graph metrics on the relational structure, which is outside the typical SQL workflow.

PQL Query

PREDICT COUNT(orders.*, 0, 30) > 0
FOR EACH customers.customer_id

Instead of engineering relational features manually, a foundation model discovers all cross-table patterns automatically. This PQL query replaces hundreds of lines of SQL joins and aggregations.

Output

customer_idpredictionconfidencetop_cross_table_signal
C-48210.87highSimilar customers churning (3-hop)
C-10930.42mediumProduct return rate rising (2-hop)
C-77560.15highSupport CSAT above avg (2-hop)
C-33020.94highPayment method removed (1-hop)

Part 2: Tools for feature engineering

SQL and Python

The baseline. Most feature engineering happens in SQL (for cross-table joins and aggregations) and Python/Pandas (for transformations and encoding). This is the most flexible approach but the most labor-intensive. The 12.3 hours and 878 lines of code from the Stanford study represent this approach.

Featuretools

The leading automated feature engineering library for relational data. Featuretools implements "deep feature synthesis": given a set of tables and their relationships, it automatically generates features by applying aggregation primitives (sum, count, mean, max, min) across join paths, and transform primitives (year, month, weekday) on individual columns. It can stack these operations to create multi-hop features.

The strength: it dramatically reduces manual effort and can discover features that humans overlook. The weakness: feature explosion. On a 5-table database, Featuretools can generate 10,000+ features, most of which are noise. Feature selection becomes the new bottleneck.

Feature stores

Tecton, Feast, and Hopsworks solve the operational side of feature management: storing feature values, serving them at low latency for real-time inference, ensuring consistency between training and serving, and tracking feature lineage and versioning. Enterprise feature stores cost $100K-500K/year.

Feature stores are essential for production ML systems. But they are infrastructure for managing features, not creating them. A feature store without good features is an expensive, empty database.

Domain-specific tools

tsfresh: Automated extraction of 794 time-series features with built-in significance testing. Effective for sensor data and IoT applications.

Feature Engine: A scikit-learn compatible library with transformers for encoding, discretization, outlier handling, and feature selection. Clean API for building reproducible feature engineering pipelines.

Category Encoders: 15+ encoding methods for categorical variables, from simple (one-hot, ordinal) to advanced (target encoding, leave-one-out, CatBoost encoding).

Manual feature engineering

  • 12.3 hours and 878 lines of code per task
  • Explores tiny fraction of feature space
  • Multi-hop features rarely attempted
  • Temporal sequences destroyed by aggregation
  • Each new task starts from scratch

Foundation model (no feature engineering)

  • Connect database, specify target, get predictions
  • Full relational structure explored automatically
  • Multi-hop patterns captured by graph attention
  • Temporal sequences preserved and learned from
  • Same model handles any prediction task

Part 3: The paradigm shift

Every technique and tool described above shares an assumption: raw data must be converted into a flat numerical table before a model can learn from it. This assumption was true for every ML algorithm from logistic regression through gradient-boosted trees. It is no longer true.

Relational Deep Learning, published at ICML 2024, demonstrated that relational databases can be represented as temporal heterogeneous graphs (rows are nodes, foreign keys are edges, timestamps order events) and that graph neural networks trained on this structure outperform manually engineered features.

The RelBench benchmark quantifies the gap. Across 7 databases, 30 tasks, and 103 million rows: LightGBM with manually engineered features achieves 62.44 AUROC. A supervised GNN on the relational graph achieves 75.83 AUROC. KumoRFM zero-shot achieves 76.71 AUROC. KumoRFM fine-tuned achieves 81.14 AUROC.

The model that reads raw relational data outperforms the model that uses carefully engineered features by 14+ AUROC points. And it does so with zero feature engineering, zero per-task training (in the zero-shot case), and zero SQL.

relbench_feature_engineering_vs_learned

ApproachAUROC (Avg)Features ExploredTime per TaskLOC per Task
Manual FE + LightGBM62.4450-200 (human-selected)12.3 hours878
Featuretools + LightGBM~64-661,000-10,000 (auto)2-4 hours50-100
Supervised GNN75.83All (graph-learned)Hours-days~200
KumoRFM zero-shot76.71All (pre-trained)Seconds1
KumoRFM fine-tuned81.14All (adapted)2-8 hours1

RelBench benchmark: 7 databases, 30 tasks, 103M+ rows. The model that skips feature engineering outperforms the model built on manually engineered features.

Why the model wins

Three structural advantages explain the gap.

Coverage. A human data scientist explores dozens to hundreds of features. A graph transformer attends to every relationship in the subgraph. It evaluates multi-hop paths, temporal sequences, and cross-table interactions that no human would enumerate.

Preservation. Feature engineering destroys information through aggregation. "Five orders in 30 days" loses the sequence, the spacing, the trend. The graph transformer sees the raw sequence of events with timestamps, preserving the temporal dynamics.

Transfer. A pre-trained foundation model has seen patterns across thousands of databases. It knows that recency, frequency, monetary value, graph centrality, and temporal acceleration are predictive universally. When it sees a new database, it applies these learned priors. A human data scientist on a new database starts from scratch.

When you still need feature engineering

Foundation models do not eliminate feature engineering for every ML task. Computer vision models still need image preprocessing (resizing, normalization, augmentation). NLP models still need tokenization and text cleaning. Reinforcement learning environments still need state representation engineering.

For structured, relational, and tabular data, which accounts for an estimated 80% of enterprise ML use cases, the evidence is clear: learning directly from the data structure outperforms manually engineering features from it. If your data lives in a relational database and your goal is prediction, feature engineering is now optional.

Practical recommendations

If you are starting a new prediction project on relational data: try a foundation model first. It will give you a baseline in minutes. If the accuracy meets your needs, you are done. If it does not, the foundation model's results tell you how much room for improvement exists and whether custom feature engineering is likely to close the gap.

If you have existing ML pipelines with extensive feature engineering: benchmark them against a foundation model on the same tasks. In many cases, the foundation model matches or exceeds the custom pipeline's accuracy while eliminating the maintenance burden. Migrating saves $300K-500K per year per use case in maintenance costs alone.

If you are teaching or learning feature engineering: understand the techniques. They remain valuable for building intuition about data and for the domains where they are still necessary. But also understand that for relational data, the field is moving toward models that learn features automatically, and the practitioner who can leverage both approaches will be more effective than one who can only do manual engineering.

The complete guide to feature engineering ends with a surprising conclusion: for the majority of enterprise ML, the best feature engineering strategy is no feature engineering at all.

PQL Query

PREDICT AVG(orders.amount, 0, 90)
FOR EACH products.product_id

Predicting 90-day average order value per product. The model automatically discovers which customer segments, seasonal patterns, and cross-product signals drive the prediction.

Output

product_idpredicted_avg_valuecurrent_avgkey_driver
SKU-8847$142.30$128.50Holiday season approaching
SKU-2291$34.80$67.20Competitor price drop detected
SKU-5510$89.00$85.00Repeat purchasers increasing
SKU-1123$210.50$195.00Bundle attachment rate rising

Frequently asked questions

What are the main types of feature engineering?

The main types are: numerical features (scaling, binning, log transforms, polynomial features, interaction features), categorical features (one-hot encoding, target encoding, frequency encoding, embedding), temporal features (time-windowed aggregations, lag features, rolling statistics, cyclical encoding for day/month), text features (TF-IDF, word embeddings, n-grams), and relational features (cross-table aggregations, graph-based features, multi-hop path features). Relational features are the most time-consuming and the most predictive in enterprise settings.

What is the difference between a feature store and feature engineering?

Feature engineering is the process of creating features from raw data. A feature store is infrastructure for storing, serving, and managing features after they are created. Feature stores (Tecton, Feast, Hopsworks) solve the operational problem of serving features at low latency with training-serving consistency. They do not solve the creation problem. Someone still has to decide what features to build. A feature store without good features is an empty warehouse.

What tools are available for automated feature engineering?

The main tools are: Featuretools (Python library for automated feature generation on relational data, generates features through deep feature synthesis), tsfresh (automated time-series feature extraction), Feature Engine (scikit-learn compatible transformers), and AutoFeat (automatic feature engineering and selection). These tools generate features programmatically but face a feature explosion problem: they create thousands of features, most of which are noise, requiring another round of feature selection.

How do you handle high-cardinality categorical features?

High-cardinality categoricals (user IDs, product IDs, zip codes) cannot use one-hot encoding because it creates too many columns. Common approaches: target encoding (replace each category with its mean target value, with regularization to prevent overfitting), frequency encoding (replace with occurrence count), hash encoding (map to fixed-size space using hash function), and learned embeddings (train a neural network to learn dense vector representations). For relational data, foundation models handle this automatically through their graph representation.

Is feature engineering still necessary with foundation models?

For prediction tasks on relational/tabular data, no. Relational foundation models like KumoRFM learn directly from raw database tables without any feature engineering. They represent the database as a temporal graph and use graph transformers to automatically discover predictive patterns. On the RelBench benchmark, this zero-shot approach outperforms manual feature engineering by experienced data scientists. Feature engineering remains necessary for specialized domains (computer vision, NLP) where foundation models need domain-specific input preprocessing.

See it in action

KumoRFM delivers predictions on relational data in seconds. No feature engineering, no ML pipelines. Try it free.