The term "data science agent" entered mainstream vocabulary in 2025 when Databricks, Google, and a wave of startups all shipped AI systems that could write data science code, build models, or deliver predictions with minimal human intervention. The AI agents market hit $7.1 billion in 2025 and is projected to reach $54.8 billion by 2032 at a 33.9% CAGR. Forty percent of Global 2000 companies are expected to use AI agents by 2026.
But "data science agent" means very different things depending on who is selling it. Some agents write Python code in a notebook. Some build drag-and-drop models from a CSV. Some read your relational database and deliver predictions directly. These are not incremental differences. They represent fundamentally different philosophies about what should be automated.
The headline result: SAP SALT benchmark
The SAP SALT benchmark is an enterprise-grade evaluation where real business analysts and data scientists attempt prediction tasks on SAP enterprise data. It measures how accurately different approaches predict real business outcomes on production-quality enterprise databases with multiple related tables.
sap_salt_enterprise_benchmark
| approach | accuracy | what_it_means |
|---|---|---|
| LLM + AutoML | 63% | Language model generates features, AutoML selects model |
| PhD Data Scientist + XGBoost | 75% | Expert spends weeks hand-crafting features, tunes XGBoost |
| KumoRFM (zero-shot) | 91% | No feature engineering, no training, reads relational tables directly |
SAP SALT benchmark: KumoRFM outperforms expert data scientists by 16 percentage points and LLM+AutoML by 28 percentage points on real enterprise prediction tasks.
KumoRFM scores 91% where PhD-level data scientists with weeks of feature engineering and hand-tuned XGBoost score 75%. The 16 percentage point gap is the value of reading relational data natively instead of flattening it into a single table.
Three types of data science agents
The data science agent landscape breaks into three distinct categories, each automating a different part of the workflow:
- Code-generating agents that write Python, SQL, and notebook code for you (Databricks Genie, Sphinx, Google DS-STAR). These target data scientists and automate the typing, not the thinking.
- No-code prediction platforms that let business users build models without writing code (Julius AI, Akkio, Obviously AI). These target analysts and automate simple, single-table predictions.
- Foundation model agents that understand relational data structure and deliver predictions directly (Kumo). These target the prediction itself and automate the entire pipeline from raw tables to output.
data_science_agent_landscape_2026
| agent | approach | data_types | autonomous | multi_table | production_ready |
|---|---|---|---|---|---|
| Kumo AI | Foundation model + agent | Relational (multi-table) | Full pipeline | Yes (native) | Yes |
| Databricks Genie | Code generation | Flat tables/notebooks | Assisted | Manual joins | Yes |
| Sphinx | Jupyter copilot | Flat tables/notebooks | Assisted | Manual joins | Early |
| Google DS-STAR | Multi-agent framework | Flat tables | Autonomous | Manual joins | Research |
| Julius AI | Chat-to-analysis | CSV/connections | Assisted | No | Yes (analytics only) |
| Akkio | No-code drag-drop | CSV upload | Assisted | No | Yes (simple models) |
Highlighted: Kumo is the only agent that natively understands multi-table relational data. All other agents operate on flat tables, requiring manual joins to combine information across tables.
Code-generating agents
Code-generating agents are the most visible category in 2026. They sit inside notebooks and IDEs, watch what you are doing, and write Python or SQL code to help. The pitch: a senior data scientist's productivity, available to everyone.
Databricks Genie Code
Databricks Genie Code is integrated into Databricks notebooks. It claims a 2x success rate over leading coding agents on data science tasks and a 60-80% reduction in processing time. It generates Python and SQL, executes it in the notebook, and iterates based on results. For teams already on Databricks, it is the most frictionless entry point.
Sphinx
Sphinx raised $9.5 million in seed funding from Lightspeed and Bessemer Venture Partners. It is Jupyter-native, meaning it works inside the existing notebook workflow that most data scientists already use. It generates code cells, explains its reasoning, and can iterate on errors.
Google DS-STAR
Google's DS-STAR is a multi-agent framework where specialized agents plan, code, and verify data science tasks. It represents the most autonomous approach in the code-generating category, with agents that can decompose complex tasks into subtasks and verify their own outputs. It remains a research project for now.
The limitation of code-generating agents
Code-generating agents make data scientists faster at writing code. But the bottleneck in enterprise data science is not typing speed. It is knowing which tables to join, which features to compute, and which temporal patterns matter. A code-generating agent can write a LEFT JOIN in seconds, but it cannot tell you whether that join captures the right signal for your prediction task.
These agents still operate on flat tables. They generate code that processes one table at a time, and they rely on the human to specify the multi-table logic. The feature engineering bottleneck (12.3 hours, 878 lines of code per task on RelBench) becomes faster to type but no less intellectually demanding.
No-code prediction platforms
No-code platforms take the opposite approach: instead of making data scientists faster, they try to eliminate the need for data scientists entirely by giving business users point-and-click model building.
Julius AI
Julius AI offers a chat interface where users describe what they want to predict, upload a CSV or connect to a data source, and get a model back. Pricing ranges from free to $70/month, and it holds SOC 2 Type II certification. It works well for exploratory analytics and simple predictions on a single dataset.
Akkio
Akkio starts at $49/month and provides a no-code drag-and-drop interface for building predictive models. Users upload a CSV, select a target column, and Akkio trains and deploys a model. It is designed for marketing teams, small businesses, and agencies that need quick predictions without a data science team.
The limitation of no-code platforms
No-code platforms work on a single flat table. They cannot read a relational database with customers, orders, products, and support tickets linked by foreign keys. They cannot discover that a customer's churn risk depends on the return rate of products they purchased, because they never see the products table.
For enterprise use cases where the predictive signal lives in the relationships between tables, not within a single table, no-code platforms miss the most important patterns. They are useful for quick analytics on flat exports, but they cannot replace a production ML pipeline.
Foundation model agents
Foundation model agents represent a fundamentally different approach. Instead of generating code or building simple models, they understand relational data structure natively and deliver predictions directly.
Kumo's agent is built on KumoRFM, a foundation model pre-trained on thousands of diverse relational databases. It represents your database as a temporal heterogeneous graph, where each row becomes a node, each foreign key becomes an edge, and timestamps are preserved as temporal attributes. A graph transformer processes this structure, learning which cross-table patterns are predictive.
The result: you describe what you want to predict in a single query, and the model reads your raw relational tables, discovers the relevant features, and returns predictions. No code, no flat tables, no manual feature engineering.
Code-generating agent workflow
- Agent writes SQL to join tables into flat table
- Agent writes Python to compute features
- Agent writes code to train and tune models
- Human reviews, iterates, and debugs each step
- Output: a model trained on a manually-defined feature table
Foundation model agent workflow
- User describes prediction task in one query
- Agent reads raw relational tables directly
- Model discovers cross-table features automatically
- Predictions returned in seconds, no iteration needed
- Output: predictions from the full relational data structure
PQL Query
PREDICT churn_90d FOR EACH customers.customer_id WHERE customers.segment = 'enterprise'
One PQL query replaces the entire code-generating agent workflow. No notebook code, no flat table construction, no model selection. The foundation model reads raw relational tables (customers, orders, support_tickets, product_usage) and delivers predictions in 1 second.
Output
| customer_id | churn_prob | top_signal | time_to_predict |
|---|---|---|---|
| C-4401 | 0.87 | Support tickets up 3x, product usage down 40% | 1 sec |
| C-4402 | 0.12 | Expanding seats, high feature adoption | 1 sec |
| C-4403 | 0.64 | Contract renewal in 30d, declining engagement | 1 sec |
| C-4404 | 0.03 | Multi-department usage, recent expansion | 1 sec |
AUROC (Area Under the Receiver Operating Characteristic curve) measures how well a model distinguishes between positive and negative outcomes. An AUROC of 50 means random guessing, 100 means perfect prediction. Moving from 65 to 77 AUROC means the model correctly ranks a true positive above a true negative 77% of the time instead of 65%.
time_and_accuracy_comparison
| approach | time_to_prediction | AUROC | multi_table_support | human_hours_per_task |
|---|---|---|---|---|
| Code-generating agent + flat table | Hours to days | ~62-66 | Manual joins (agent-written) | 4-8 |
| No-code platform + CSV | Minutes | ~55-62 | None | 0.5-1 |
| KumoRFM zero-shot | 1 second | 76.71 | Native (automatic) | 0.001 |
| KumoRFM fine-tuned | Minutes (tuning) + 1 second | 81.14 | Native (automatic) | 0.1 |
Highlighted: KumoRFM delivers higher accuracy in less time because it reads relational data directly. The 10+ AUROC point gap over code-generating agents reflects the value of understanding data structure, not just writing code faster.
What to look for in a data science agent
Not all data science agents solve the same problem. When evaluating agents for enterprise use, these are the criteria that separate tools that speed up the workflow from tools that transform it:
- Multi-table relational data support. Does the agent read multiple related tables natively, or does it require someone to flatten the data first? If your predictive signal lives in relationships between tables (and in enterprise data, it almost always does), this is the most important criterion.
- Feature discovery vs. code generation. Does the agent discover which features matter, or does it just write code to compute features you specify? The first automates the thinking. The second automates the typing.
- Time to first prediction. Can you go from a new prediction task to a usable model in seconds, hours, or weeks? Agents that require notebook iteration cycles still have latency measured in hours or days.
- Production readiness. Can the agent's output run in a production pipeline with monitoring, retraining, and drift detection? Research prototypes and analytics-only tools often require a separate production engineering effort.
- Accuracy on relational benchmarks. Look at performance on multi-table benchmarks like RelBench, not single-table Kaggle datasets. Single-table benchmarks do not test the feature discovery that matters most in enterprise data.
- Autonomy level. How much human intervention does each prediction task require? Fully autonomous agents deliver predictions from a query. Assisted agents require iterative review and debugging of generated code.