AutoML (Automated Machine Learning) refers to tools and techniques that automate parts of the machine learning pipeline, primarily model selection, hyperparameter tuning, and model ensembling. Major platforms include Google AutoML, DataRobot, H2O, and Auto-sklearn. AutoML reduces the expertise needed to train a model on a flat feature table, but it does not automate the data preparation or feature engineering steps that consume 80% of project time.

What does AutoML actually automate?

AutoML automates three main steps: (1) model selection, testing multiple algorithms (XGBoost, random forest, neural nets) on your data; (2) hyperparameter optimization, searching for the best learning rate, tree depth, and other settings; and (3) ensembling, combining multiple models for better accuracy. Some platforms also automate basic feature transformations like one-hot encoding and normalization. These steps represent roughly 20% of the total ML pipeline effort.

Can AutoML replace data scientists?

AutoML reduces the need for ML expertise in model selection and tuning, but it cannot replace data scientists for the hardest parts of the pipeline: understanding the business problem, preparing relational data, engineering features from multi-table databases, validating results, and deploying models. Kaggle competitions, where data arrives pre-cleaned in a single CSV, are the ideal AutoML use case. Enterprise projects, where data spans 10-50 tables, are not.

Why doesn't AutoML handle feature engineering from relational data?

AutoML tools were designed around the assumption that input data is a single flat table. They operate on one CSV or DataFrame. Generating features from a multi-table relational database requires understanding schema relationships, writing SQL joins, choosing aggregation strategies, and handling temporal ordering. This is a fundamentally different problem from model selection, and it requires a different architecture: one that reads relational structure natively, like graph neural networks or relational foundation models.

How do foundation models compare to AutoML?

Foundation models like KumoRFM address a different and larger problem than AutoML. AutoML optimizes the modeling step (the last 20%) on a pre-built flat table. KumoRFM eliminates the feature engineering step (the first 80%) by reading relational databases directly. On RelBench, KumoRFM zero-shot achieves 76.71 AUROC versus 62.44 for LightGBM with manual features. AutoML would improve the LightGBM number marginally through better tuning, but it cannot close a 14-point gap caused by missing relational signal.

AutoML: What It Actually Automates (and the 80% It Doesn't) | Kumo.ai

In 2019, DataRobot raised $206 million at a $1.7 billion valuation on the promise of "democratizing AI." The pitch was compelling: upload your data, and the platform automatically selects the best model, tunes hyperparameters, and delivers predictions. No PhD required. Google, Microsoft, and Amazon all launched competing AutoML products.

AutoML delivered on its core promise. It genuinely automates model selection and tuning. A data scientist who would spend two days comparing XGBoost configurations can get equivalent results in minutes. That is real value.

But there is a step that every AutoML demo quietly skips: the "upload your data" part. The platform needs a single flat CSV or DataFrame. One row per entity, one column per feature. If your data lives in a relational database with 15 interconnected tables, someone has to flatten it first. That flattening process is feature engineering, and it takes 80% of total project time.

AutoML automates the last 20%. The hard 80% is untouched.

What AutoML actually does well

To be fair to AutoML, the problems it solves are real. Before AutoML, model selection was a manual, expertise-heavy process.

Model selection

Given a flat feature table and a target variable, AutoML tests multiple algorithms: logistic regression, random forest, XGBoost, LightGBM, CatBoost, various neural network architectures. It evaluates each on a holdout set and ranks them by performance. A 2020 study by Erickson et al. found that AutoGluon (Amazon's AutoML framework) matched or beat the median Kaggle competitor on 39 of 50 benchmark datasets. For teams without deep ML expertise, this is transformative.

Hyperparameter optimization (HPO)

Each algorithm has dozens of settings: learning rate, maximum tree depth, regularization strength, number of estimators. The optimal combination is dataset-specific. AutoML uses techniques like Bayesian optimization (SMAC, Optuna), bandit-based methods (Hyperband), or evolutionary search to explore the hyperparameter space efficiently. This reliably squeezes 1-3% additional accuracy from a given model.

Neural Architecture Search (NAS)

For deep learning models, AutoML can search over network architectures: number of layers, hidden dimensions, activation functions, skip connections. Google's NASNet, discovered by AutoML, achieved state-of-the-art performance on ImageNet in 2018. NAS is computationally expensive but has produced architectures that humans did not design.

Model ensembling

Combining predictions from multiple models (stacking, blending, weighted averaging) almost always improves accuracy. AutoML automates the ensemble construction, selecting which models to combine and how to weight them. AutoGluon's default stacking ensemble is one of its strongest features.

The 80% AutoML cannot touch

To see the limitation concretely, consider a B2B SaaS company using AutoML for lead scoring. The data lives across three CRM tables.

leads

lead_id	company	industry	employees	source
LD-401	Meridian Corp	Financial Services	2,400	Webinar
LD-402	BrightPath Inc	Healthcare	180	Organic Search
LD-403	Vertex Dynamics	Manufacturing	8,500	Referral
LD-404	CloudScale Labs	Technology	45	PPC Ad

activities

activity_id	lead_id	type	date	duration
ACT-01	LD-401	Demo request	2025-10-05	---
ACT-02	LD-401	Pricing page visit	2025-10-06	4 min
ACT-03	LD-401	Case study download	2025-10-07	---
ACT-04	LD-402	Blog visit	2025-09-20	1 min
ACT-05	LD-403	Demo completed	2025-10-12	38 min
ACT-06	LD-403	Security review request	2025-10-14	---
ACT-07	LD-404	Free trial signup	2025-11-01	---

Notice: LD-401 has a rapid progression (demo, pricing, case study in 3 days). LD-403 is enterprise-stage (demo + security review). These sequences matter but AutoML never sees them.

conversions

lead_id	stage	deal_value	closed_date
LD-401	Negotiation	$86,000	---
LD-403	Security Review	$420,000	---
LD-404	Free Trial	$0	---

Highlighted: LD-403 is a $420K deal in security review. AutoML sees 'stage = Security Review' as a flat feature. It cannot see that the demo-to-security-review progression in 2 days signals urgency.

The machine learning pipeline has five stages. AutoML addresses stages 3 and 4. Stages 1 and 2 are where the time goes.

Stage 1: Problem definition (5%). What are we predicting? What business decision does this inform? What is the target variable? This requires domain expertise, not automation.

Stage 2: Data preparation and feature engineering (60-80%). This is the gap. Your prediction data lives across multiple relational tables. The AutoML platform needs a single flat table. Bridging that gap requires: mapping the relational schema, writing SQL joins, selecting aggregation functions and time windows, engineering derived features, handling missing values, encoding categoricals, and iterating based on model feedback. A Stanford study measured this at 12.3 hours and 878 lines of code per prediction task.

Stage 3: Model selection and training (5-10%). AutoML handles this well.

Stage 4: Hyperparameter tuning (5-10%). AutoML handles this well.

Stage 5: Deployment and monitoring (10-15%). Some AutoML platforms offer deployment features, but production ML infrastructure varies widely across organizations.

The flat table assumption

Every AutoML platform, without exception, requires input as a flat table. DataRobot's documentation: "Upload a dataset (CSV, Parquet, or connect to a data source)." H2O's documentation: "Import your data as a single H2OFrame." Google AutoML Tables: "Your dataset must be a single BigQuery table or CSV."

This is not a design oversight. It is an architectural constraint. AutoML was built on top of traditional ML frameworks (scikit-learn, XGBoost, TensorFlow) that all assume flat tabular input. Changing this would require a fundamentally different architecture, one that operates on relational structure natively.

What gets lost in flattening

When a data scientist flattens the three CRM tables above into a single feature table for AutoML, this is what the platform actually receives.

flat_feature_table (what AutoML sees)

lead_id	industry	employees	source	activity_count	days_since_last	avg_duration
LD-401	Financial Services	2,400	Webinar	3	4	4 min
LD-402	Healthcare	180	Organic Search	1	52	1 min
LD-403	Manufacturing	8,500	Referral	2	27	38 min
LD-404	Technology	45	PPC Ad	1	0	0 min

AutoML sees activity_count = 3 for LD-401 and activity_count = 2 for LD-403. It cannot see that LD-401's 3 activities happened in 3 consecutive days (demo, pricing, case study), a rapid buying sequence. Or that LD-403's demo lasted 38 minutes followed by a security review, an enterprise buying signal. The sequence and the activity types are gone.

Three categories of signal are destroyed in this process. Let's look at each one with the actual CRM data above.

Signal destroyed: multi-hop relationships

LD-403 (Vertex Dynamics) was referred by an existing customer. That referring customer previously bought a $280K deal and renewed twice. Their referral history has a 70% close rate. This signal exists in the CRM: lead → referral source → past deals → renewal history. Three hops.

AutoML sees source = Referral. One flat string. It cannot distinguish a referral from a customer who churned from a referral from your best account. The multi-hop context (who referred them, what is that referrer's track record) is invisible.

what AutoML sees vs. what exists in the relational data

Signal	AutoML flat table	Relational graph (3 hops)
Referral quality	source = 'Referral'	Referred by Acct #2847, who closed $280K, renewed 2x, and has 70% referral close rate
Buying stage	activity_count = 2	Demo (38 min) followed by Security Review request in 2 days. Enterprise buying pattern.
Industry pattern	industry = 'Manufacturing'	3 of last 5 Manufacturing deals >$200K closed in Q4. Seasonal enterprise budget cycle.

Same lead, three different views. AutoML operates on the left column. The right column is what exists in the database but never makes it into the flat table.

Signal destroyed: temporal sequences

LD-401 (Meridian Corp) has activity_count = 3. But those 3 activities happened on October 5, 6, and 7: demo request, pricing page, case study download. Three days. That is a buyer in active evaluation, accelerating toward a decision.

Compare that to a hypothetical lead with 3 activities spread over 3 months: a blog visit in August, a webinar in September, a whitepaper in October. Same activity_count = 3. Completely different intent. The first lead is hot. The second is lukewarm. AutoML sees identical numbers for both.

two leads with activity_count = 3 (identical in flat table)

Lead	Activity 1	Activity 2	Activity 3	Span	Intent
LD-401	Demo request (Oct 5)	Pricing page (Oct 6)	Case study (Oct 7)	3 days	Hot: accelerating buyer
LD-999	Blog visit (Aug 12)	Webinar (Sep 18)	Whitepaper (Oct 22)	71 days	Lukewarm: passive research

AutoML sees activity_count = 3 for both. The 3-day sprint vs 71-day drift is the difference between a $86K deal closing this quarter and a lead that needs 6 more months of nurturing. The sequence is destroyed by the count.

Signal destroyed: graph topology

LD-401's company (Meridian Corp, Financial Services, 2,400 employees) is similar in size, industry, and buying pattern to 4 other accounts in the CRM. Of those 4 similar accounts, 3 closed deals averaging $95K in the last two quarters. That is a graph signal: Meridian is embedded in a cluster of converting accounts.

LD-404's company (CloudScale Labs, Technology, 45 employees) is also similar to 4 other accounts. But those 4 similar accounts all churned or never converted. CloudScale is embedded in a cluster of non-converting accounts.

account neighborhood (graph topology)

Lead	Similar accounts	Their outcomes	Cluster signal
LD-401 (Meridian)	4 FinServ accounts, 1K-5K employees	3 of 4 closed ($78K, $95K, $112K)	High-conversion cluster
LD-404 (CloudScale)	4 Tech startups, 20-80 employees	0 of 4 converted (all churned from trial)	Non-converting cluster

AutoML scores both leads based on their individual attributes. It cannot see that Meridian sits in a neighborhood of buyers while CloudScale sits in a neighborhood of churners. This peer signal often outweighs individual features.

AutoML reduces these network neighborhoods to flat attributes:industry = Financial Services,employees = 2400. The shape of the surrounding accounts, their conversion history, their similarity patterns, all of this is invisible after flattening. Graph-based models see it natively because they operate on the connection structure.

AutoML approach

Requires a pre-built flat feature table
Automates model selection and HPO (20% of work)
Feature engineering is still manual (80% of work)
Cannot see multi-hop or temporal signals lost in flattening
62.44 AUROC (LightGBM + manual features on RelBench)

Foundation model approach

Reads relational database directly, no flat table needed
Eliminates feature engineering entirely (the 80%)
Discovers multi-hop, temporal, and graph patterns automatically
Same model handles any prediction task on the database
76.71 AUROC (KumoRFM zero-shot on RelBench)

The vendors know this

AutoML vendors are not unaware of the feature engineering gap. Their response has been to add "automated feature engineering" modules, but these are limited to single-table transformations: binning, polynomial features, log transforms, interaction terms. They operate on the flat table that already exists. They do not generate features from the relational structure upstream.

Featuretools, an open-source library from Alteryx, goes further. It defines "Deep Feature Synthesis," which automatically generates features from multi-table relational data by applying aggregation primitives across join paths. This is a better approach, but it has its own problems: it produces thousands of features (most of them noise), requires a separate feature selection step, and is limited to predefined aggregation functions. It cannot learn new types of patterns the way a neural network can.

The fundamental issue is that feature engineering from relational data is not an optimization problem (where search is sufficient). It is a representation learning problem (where you need a model that can discover patterns in the relational structure). AutoML treats it as the former. Foundation models treat it as the latter.

PQL Query

PREDICT conversions.stage = 'Closed Won'
FOR EACH leads.lead_id

One PQL query replaces the entire lead scoring pipeline. The model reads leads, activities, and conversions as a graph and discovers that activity velocity, sequence patterns, and industry-stage combinations predict conversion.

Output

lead_id	conversion_probability	top_signal
LD-401	0.74	Rapid demo-pricing-casestudy sequence
LD-402	0.12	Single low-engagement blog visit
LD-403	0.81	Enterprise buying signals (demo + security)
LD-404	0.08	Small company, PPC source, trial-only

The accuracy gap that AutoML cannot close

Here is the quantitative argument. On the RelBench benchmark (7 databases, 30 tasks, 103 million rows), the best possible flat-table approach (a skilled human spending 12.3 hours per task to engineer features, then training LightGBM) achieves 62.44 AUROC on classification tasks.

AutoML can improve the model selection and tuning portion of this workflow. Realistically, that adds 1-3 AUROC points. You might reach 64-65 AUROC with perfect model selection on the same flat features.

KumoRFM, which reads the relational structure directly, achieves 76.71 AUROC zero-shot. The gap between 65 and 76.71 is not a model selection gap. It is an information gap. The foundation model sees patterns in the relational structure that flat-table models cannot access regardless of how well they are tuned.

No amount of AutoML optimization can compensate for data that was destroyed before the model saw it.

When AutoML is the right choice

AutoML is genuinely valuable in specific scenarios. If your data already exists as a clean flat table (a Kaggle dataset, a pre-aggregated data warehouse table, sensor readings), AutoML will find a strong model faster than manual experimentation. If your team lacks deep ML expertise and needs to train a model on existing features, AutoML platforms like AutoGluon or H2O reduce the barrier.

AutoML is also useful for rapid prototyping. Upload a quick feature table, see if the signal exists, then decide whether to invest in more sophisticated approaches. It is a good first pass.

Where AutoML falls short is the typical enterprise scenario: data spread across multiple relational tables, questions that require cross-table patterns, temporal signals that aggregation destroys, and a pipeline that must be rebuilt for every new prediction task.

The real automation gap

The ML industry spent a decade automating the easy part and calling it "automated machine learning." Model selection is well-solved. Hyperparameter tuning is well-solved. Ensembling is well-solved. These advances are real and valuable.

But the 80% of work that is feature engineering from relational data remained manual because it requires a fundamentally different approach. You cannot search your way to good features across 15 tables with millions of rows. You need a model that learns representations from the relational structure directly.

That model exists now. Relational foundation models like KumoRFM skip the step that AutoML cannot automate. They read multi-table databases natively, discover predictive patterns through graph transformers, and deliver predictions without any feature engineering or model training. The entire pipeline that AutoML partially automates is replaced by a single inference call.

The question for ML teams is not "which AutoML platform should we use.” It is “why are we still building flat feature tables at all.”

Key Takeaways

1AutoML genuinely automates model selection, hyperparameter tuning, and ensembling. On flat feature tables, AutoGluon matches top-10% Kaggle performance without human intervention.
2The limitation is architectural: every AutoML platform requires a single flat table as input. Enterprise data spanning 10-50 relational tables must be manually flattened before AutoML can touch it.
3AutoML optimizes the last 20% of the ML pipeline (modeling). The first 80% (feature engineering from relational data) remains manual: 12.3 hours and 878 lines of code per task.
4The accuracy gap is structural, not algorithmic. Perfect AutoML tuning on flat features might add 2-3 AUROC points. KumoRFM's relational approach adds 14+ points by accessing patterns that flat tables cannot express.
5Foundation models address the 80% that AutoML cannot: they read multi-table relational data directly, discover cross-table patterns automatically, and replace the entire pipeline with a single query.

AutoML: What It Actually Automates (and the 80% It Doesn't)