In 2019, DataRobot raised $206 million at a $1.7 billion valuation on the promise of "democratizing AI." The pitch was compelling: upload your data, and the platform automatically selects the best model, tunes hyperparameters, and delivers predictions. No PhD required. Google, Microsoft, and Amazon all launched competing AutoML products.
AutoML delivered on its core promise. It genuinely automates model selection and tuning. A data scientist who would spend two days comparing XGBoost configurations can get equivalent results in minutes. That is real value.
But there is a step that every AutoML demo quietly skips: the "upload your data" part. The platform needs a single flat CSV or DataFrame. One row per entity, one column per feature. If your data lives in a relational database with 15 interconnected tables, someone has to flatten it first. That flattening process is feature engineering, and it takes 80% of total project time.
AutoML automates the last 20%. The hard 80% is untouched.
What AutoML actually does well
To be fair to AutoML, the problems it solves are real. Before AutoML, model selection was a manual, expertise-heavy process.
Model selection
Given a flat feature table and a target variable, AutoML tests multiple algorithms: logistic regression, random forest, XGBoost, LightGBM, CatBoost, various neural network architectures. It evaluates each on a holdout set and ranks them by performance. A 2020 study by Erickson et al. found that AutoGluon (Amazon's AutoML framework) matched or beat the median Kaggle competitor on 39 of 50 benchmark datasets. For teams without deep ML expertise, this is transformative.
Hyperparameter optimization (HPO)
Each algorithm has dozens of settings: learning rate, maximum tree depth, regularization strength, number of estimators. The optimal combination is dataset-specific. AutoML uses techniques like Bayesian optimization (SMAC, Optuna), bandit-based methods (Hyperband), or evolutionary search to explore the hyperparameter space efficiently. This reliably squeezes 1-3% additional accuracy from a given model.
Neural Architecture Search (NAS)
For deep learning models, AutoML can search over network architectures: number of layers, hidden dimensions, activation functions, skip connections. Google's NASNet, discovered by AutoML, achieved state-of-the-art performance on ImageNet in 2018. NAS is computationally expensive but has produced architectures that humans did not design.
Model ensembling
Combining predictions from multiple models (stacking, blending, weighted averaging) almost always improves accuracy. AutoML automates the ensemble construction, selecting which models to combine and how to weight them. AutoGluon's default stacking ensemble is one of its strongest features.
The 80% AutoML cannot touch
To see the limitation concretely, consider a B2B SaaS company using AutoML for lead scoring. The data lives across three CRM tables.
leads
| lead_id | company | industry | employees | source |
|---|---|---|---|---|
| LD-401 | Meridian Corp | Financial Services | 2,400 | Webinar |
| LD-402 | BrightPath Inc | Healthcare | 180 | Organic Search |
| LD-403 | Vertex Dynamics | Manufacturing | 8,500 | Referral |
| LD-404 | CloudScale Labs | Technology | 45 | PPC Ad |
activities
| activity_id | lead_id | type | date | duration |
|---|---|---|---|---|
| ACT-01 | LD-401 | Demo request | 2025-10-05 | --- |
| ACT-02 | LD-401 | Pricing page visit | 2025-10-06 | 4 min |
| ACT-03 | LD-401 | Case study download | 2025-10-07 | --- |
| ACT-04 | LD-402 | Blog visit | 2025-09-20 | 1 min |
| ACT-05 | LD-403 | Demo completed | 2025-10-12 | 38 min |
| ACT-06 | LD-403 | Security review request | 2025-10-14 | --- |
| ACT-07 | LD-404 | Free trial signup | 2025-11-01 | --- |
Notice: LD-401 has a rapid progression (demo, pricing, case study in 3 days). LD-403 is enterprise-stage (demo + security review). These sequences matter but AutoML never sees them.
conversions
| lead_id | stage | deal_value | closed_date |
|---|---|---|---|
| LD-401 | Negotiation | $86,000 | --- |
| LD-403 | Security Review | $420,000 | --- |
| LD-404 | Free Trial | $0 | --- |
Highlighted: LD-403 is a $420K deal in security review. AutoML sees 'stage = Security Review' as a flat feature. It cannot see that the demo-to-security-review progression in 2 days signals urgency.
The machine learning pipeline has five stages. AutoML addresses stages 3 and 4. Stages 1 and 2 are where the time goes.
Stage 1: Problem definition (5%). What are we predicting? What business decision does this inform? What is the target variable? This requires domain expertise, not automation.
Stage 2: Data preparation and feature engineering (60-80%). This is the gap. Your prediction data lives across multiple relational tables. The AutoML platform needs a single flat table. Bridging that gap requires: mapping the relational schema, writing SQL joins, selecting aggregation functions and time windows, engineering derived features, handling missing values, encoding categoricals, and iterating based on model feedback. A Stanford study measured this at 12.3 hours and 878 lines of code per prediction task.
Stage 3: Model selection and training (5-10%). AutoML handles this well.
Stage 4: Hyperparameter tuning (5-10%). AutoML handles this well.
Stage 5: Deployment and monitoring (10-15%). Some AutoML platforms offer deployment features, but production ML infrastructure varies widely across organizations.
The flat table assumption
Every AutoML platform, without exception, requires input as a flat table. DataRobot's documentation: "Upload a dataset (CSV, Parquet, or connect to a data source)." H2O's documentation: "Import your data as a single H2OFrame." Google AutoML Tables: "Your dataset must be a single BigQuery table or CSV."
This is not a design oversight. It is an architectural constraint. AutoML was built on top of traditional ML frameworks (scikit-learn, XGBoost, TensorFlow) that all assume flat tabular input. Changing this would require a fundamentally different architecture, one that operates on relational structure natively.
What gets lost in flattening
When a data scientist flattens the three CRM tables above into a single feature table for AutoML, this is what the platform actually receives.
flat_feature_table (what AutoML sees)
| lead_id | industry | employees | source | activity_count | days_since_last | avg_duration |
|---|---|---|---|---|---|---|
| LD-401 | Financial Services | 2,400 | Webinar | 3 | 4 | 4 min |
| LD-402 | Healthcare | 180 | Organic Search | 1 | 52 | 1 min |
| LD-403 | Manufacturing | 8,500 | Referral | 2 | 27 | 38 min |
| LD-404 | Technology | 45 | PPC Ad | 1 | 0 | 0 min |
AutoML sees activity_count = 3 for LD-401 and activity_count = 2 for LD-403. It cannot see that LD-401's 3 activities happened in 3 consecutive days (demo, pricing, case study), a rapid buying sequence. Or that LD-403's demo lasted 38 minutes followed by a security review, an enterprise buying signal. The sequence and the activity types are gone.
Three categories of signal are destroyed in this process. Let's look at each one with the actual CRM data above.
Signal destroyed: multi-hop relationships
LD-403 (Vertex Dynamics) was referred by an existing customer. That referring customer previously bought a $280K deal and renewed twice. Their referral history has a 70% close rate. This signal exists in the CRM: lead → referral source → past deals → renewal history. Three hops.
AutoML sees source = Referral. One flat string. It cannot distinguish a referral from a customer who churned from a referral from your best account. The multi-hop context (who referred them, what is that referrer's track record) is invisible.
what AutoML sees vs. what exists in the relational data
| Signal | AutoML flat table | Relational graph (3 hops) |
|---|---|---|
| Referral quality | source = 'Referral' | Referred by Acct #2847, who closed $280K, renewed 2x, and has 70% referral close rate |
| Buying stage | activity_count = 2 | Demo (38 min) followed by Security Review request in 2 days. Enterprise buying pattern. |
| Industry pattern | industry = 'Manufacturing' | 3 of last 5 Manufacturing deals >$200K closed in Q4. Seasonal enterprise budget cycle. |
Same lead, three different views. AutoML operates on the left column. The right column is what exists in the database but never makes it into the flat table.
Signal destroyed: temporal sequences
LD-401 (Meridian Corp) has activity_count = 3. But those 3 activities happened on October 5, 6, and 7: demo request, pricing page, case study download. Three days. That is a buyer in active evaluation, accelerating toward a decision.
Compare that to a hypothetical lead with 3 activities spread over 3 months: a blog visit in August, a webinar in September, a whitepaper in October. Same activity_count = 3. Completely different intent. The first lead is hot. The second is lukewarm. AutoML sees identical numbers for both.
two leads with activity_count = 3 (identical in flat table)
| Lead | Activity 1 | Activity 2 | Activity 3 | Span | Intent |
|---|---|---|---|---|---|
| LD-401 | Demo request (Oct 5) | Pricing page (Oct 6) | Case study (Oct 7) | 3 days | Hot: accelerating buyer |
| LD-999 | Blog visit (Aug 12) | Webinar (Sep 18) | Whitepaper (Oct 22) | 71 days | Lukewarm: passive research |
AutoML sees activity_count = 3 for both. The 3-day sprint vs 71-day drift is the difference between a $86K deal closing this quarter and a lead that needs 6 more months of nurturing. The sequence is destroyed by the count.
Signal destroyed: graph topology
LD-401's company (Meridian Corp, Financial Services, 2,400 employees) is similar in size, industry, and buying pattern to 4 other accounts in the CRM. Of those 4 similar accounts, 3 closed deals averaging $95K in the last two quarters. That is a graph signal: Meridian is embedded in a cluster of converting accounts.
LD-404's company (CloudScale Labs, Technology, 45 employees) is also similar to 4 other accounts. But those 4 similar accounts all churned or never converted. CloudScale is embedded in a cluster of non-converting accounts.
account neighborhood (graph topology)
| Lead | Similar accounts | Their outcomes | Cluster signal |
|---|---|---|---|
| LD-401 (Meridian) | 4 FinServ accounts, 1K-5K employees | 3 of 4 closed ($78K, $95K, $112K) | High-conversion cluster |
| LD-404 (CloudScale) | 4 Tech startups, 20-80 employees | 0 of 4 converted (all churned from trial) | Non-converting cluster |
AutoML scores both leads based on their individual attributes. It cannot see that Meridian sits in a neighborhood of buyers while CloudScale sits in a neighborhood of churners. This peer signal often outweighs individual features.
AutoML reduces these network neighborhoods to flat attributes:industry = Financial Services,employees = 2400. The shape of the surrounding accounts, their conversion history, their similarity patterns, all of this is invisible after flattening. Graph-based models see it natively because they operate on the connection structure.
AutoML approach
- Requires a pre-built flat feature table
- Automates model selection and HPO (20% of work)
- Feature engineering is still manual (80% of work)
- Cannot see multi-hop or temporal signals lost in flattening
- 62.44 AUROC (LightGBM + manual features on RelBench)
Foundation model approach
- Reads relational database directly, no flat table needed
- Eliminates feature engineering entirely (the 80%)
- Discovers multi-hop, temporal, and graph patterns automatically
- Same model handles any prediction task on the database
- 76.71 AUROC (KumoRFM zero-shot on RelBench)
The vendors know this
AutoML vendors are not unaware of the feature engineering gap. Their response has been to add "automated feature engineering" modules, but these are limited to single-table transformations: binning, polynomial features, log transforms, interaction terms. They operate on the flat table that already exists. They do not generate features from the relational structure upstream.
Featuretools, an open-source library from Alteryx, goes further. It defines "Deep Feature Synthesis," which automatically generates features from multi-table relational data by applying aggregation primitives across join paths. This is a better approach, but it has its own problems: it produces thousands of features (most of them noise), requires a separate feature selection step, and is limited to predefined aggregation functions. It cannot learn new types of patterns the way a neural network can.
The fundamental issue is that feature engineering from relational data is not an optimization problem (where search is sufficient). It is a representation learning problem (where you need a model that can discover patterns in the relational structure). AutoML treats it as the former. Foundation models treat it as the latter.
PQL Query
PREDICT conversions.stage = 'Closed Won' FOR EACH leads.lead_id
One PQL query replaces the entire lead scoring pipeline. The model reads leads, activities, and conversions as a graph and discovers that activity velocity, sequence patterns, and industry-stage combinations predict conversion.
Output
| lead_id | conversion_probability | top_signal |
|---|---|---|
| LD-401 | 0.74 | Rapid demo-pricing-casestudy sequence |
| LD-402 | 0.12 | Single low-engagement blog visit |
| LD-403 | 0.81 | Enterprise buying signals (demo + security) |
| LD-404 | 0.08 | Small company, PPC source, trial-only |
The accuracy gap that AutoML cannot close
Here is the quantitative argument. On the RelBench benchmark (7 databases, 30 tasks, 103 million rows), the best possible flat-table approach (a skilled human spending 12.3 hours per task to engineer features, then training LightGBM) achieves 62.44 AUROC on classification tasks.
AutoML can improve the model selection and tuning portion of this workflow. Realistically, that adds 1-3 AUROC points. You might reach 64-65 AUROC with perfect model selection on the same flat features.
KumoRFM, which reads the relational structure directly, achieves 76.71 AUROC zero-shot. The gap between 65 and 76.71 is not a model selection gap. It is an information gap. The foundation model sees patterns in the relational structure that flat-table models cannot access regardless of how well they are tuned.
No amount of AutoML optimization can compensate for data that was destroyed before the model saw it.
When AutoML is the right choice
AutoML is genuinely valuable in specific scenarios. If your data already exists as a clean flat table (a Kaggle dataset, a pre-aggregated data warehouse table, sensor readings), AutoML will find a strong model faster than manual experimentation. If your team lacks deep ML expertise and needs to train a model on existing features, AutoML platforms like AutoGluon or H2O reduce the barrier.
AutoML is also useful for rapid prototyping. Upload a quick feature table, see if the signal exists, then decide whether to invest in more sophisticated approaches. It is a good first pass.
Where AutoML falls short is the typical enterprise scenario: data spread across multiple relational tables, questions that require cross-table patterns, temporal signals that aggregation destroys, and a pipeline that must be rebuilt for every new prediction task.
The real automation gap
The ML industry spent a decade automating the easy part and calling it "automated machine learning." Model selection is well-solved. Hyperparameter tuning is well-solved. Ensembling is well-solved. These advances are real and valuable.
But the 80% of work that is feature engineering from relational data remained manual because it requires a fundamentally different approach. You cannot search your way to good features across 15 tables with millions of rows. You need a model that learns representations from the relational structure directly.
That model exists now. Relational foundation models like KumoRFM skip the step that AutoML cannot automate. They read multi-table databases natively, discover predictive patterns through graph transformers, and deliver predictions without any feature engineering or model training. The entire pipeline that AutoML partially automates is replaced by a single inference call.
The question for ML teams is not "which AutoML platform should we use.” It is “why are we still building flat feature tables at all.”