Clinical Trial Enrollment
“Which sites will meet enrollment targets?”
Book a demo and get a free trial of the full platform: research agent, fine-tune capabilities, and forward-deployed engineer support.
By submitting, you accept the Terms and Privacy Policy.

Loved by data scientists, ML engineers & CXOs at

A real-world example
Which sites will meet enrollment targets?
80% of clinical trials fail to meet enrollment timelines. Each day of delay costs a sponsor $600K-$8M in lost patent life. A Phase III trial with 150 sites where 40% underperform wastes $50M in site management costs alone. Site selection today relies on investigator surveys and historical spreadsheets, missing the network dynamics between investigators, referring physicians, and patient populations.
Quick answer
AI predicts clinical trial site enrollment performance by connecting study details, site characteristics, investigator networks, patient catchment data, and competing trial activity into a relational graph. 80% of clinical trials fail to meet enrollment timelines, costing sponsors $600K-$8M per day of delay in lost patent life. Graph-based models identify underperforming sites 60 days earlier than traditional tracking, saving $50M in reallocation costs and accelerating enrollment by 4 months.
Approaches compared
4 ways to solve this problem
1. Investigator Surveys and Historical Data
Site selection based on investigator self-reported patient access, prior trial experience, and institutional track records. The traditional approach for Phase I-III planning. Heavily reliant on the investigator's own estimate of how many patients they can enroll.
Best for
Early feasibility assessment when you need a rough estimate of which regions and institutions to target.
Watch out for
Investigators consistently overestimate their enrollment capacity by 30-50%. Their estimates do not account for competing trials, seasonal patient availability, or changes in their referral network since the last trial.
2. Historical Site Performance Databases
Vendor databases (Citeline, Medidata) that track historical enrollment rates by site, investigator, and therapeutic area. Provides empirical performance data rather than self-reported estimates.
Best for
Ranking sites by demonstrated enrollment speed for similar trials. Data-driven alternative to investigator surveys.
Watch out for
Historical performance does not predict future performance when conditions change. A site that enrolled 40 patients in 90 days for the last oncology trial may now have 3 competing trials active, a new institutional review process, or a principal investigator who moved to a different institution.
3. Statistical Enrollment Models
Poisson regression or survival models trained on historical enrollment data to predict site-level enrollment rates. Accounts for therapeutic area, site size, and geographic factors.
Best for
Portfolio-level enrollment forecasting where you need aggregate predictions across 100+ sites for supply planning and milestone tracking.
Watch out for
Cannot capture network dynamics: investigator-referrer relationships, competing trial cannibalization, or the effect of the medical monitor's relationship with the PI on site engagement. These factors drive the 3-5x variation between top and bottom quartile sites.
4. Graph Neural Networks (Kumo's Approach)
Connects studies, sites, investigators, referral networks, patient catchment areas, and competing trials into a relational graph. Predicts per-site enrollment by learning from the full network of factors that drive patient access.
Best for
Identifying the network factors that separate high-enrolling sites from low-enrolling ones: investigator-referrer relationships, competing trial saturation, and patient catchment dynamics. Predicts performance for new sites with new investigators by transferring learning from similar network structures.
Watch out for
Requires access to investigator network data, competing trial registrations, and patient catchment demographics. ClinicalTrials.gov data is public, but detailed referral network data may require third-party sources.
Key metric: 80% of clinical trials miss enrollment timelines. Graph-based models identify underperforming sites 60 days earlier, saving $50M in reallocation costs and recovering 4 months of enrollment time.
Why relational data changes the answer
Flat enrollment models see each site as an independent row: investigator publications, prior trial count, institution tier, therapeutic area. They can predict that a site with an experienced investigator at a major academic center will enroll well on average. But they cannot see that this specific investigator's referral network has shrunk (their top 3 referring PCPs retired or moved), that 3 competing oncology trials opened at the same institution last quarter, that the screen failure rate in the first 30 days is running at 62% (suggesting enrollment criteria mismatch with the local patient population), and that the PI has only 2 connected PCPs versus 8 at the top-performing site. These network signals explain why Site SITE02 at Mayo Clinic is tracking to 30% of target while Site SITE01 at Mass General is at 85%.
Relational learning maps the trial ecosystem. The model walks from site to investigator to their referral network (how many PCPs refer patients, and are those PCPs still active), to competing trials at the same institution (are they recruiting from the same patient pool), to patient catchment demographics (does the local population match the enrollment criteria). It learns that sites where the PI has co-published with the medical monitor, maintains active referral relationships with 5+ PCPs in high-prevalence zip codes, and has no more than 1 competing trial in the same therapeutic area enroll 2.3x faster. These relational patterns are invisible to any model that scores sites as independent rows.
Predicting trial enrollment from site profiles is like predicting a restaurant's opening-month revenue based on the chef's resume and the neighborhood demographics. You miss that three competing restaurants just opened on the same block, the chef's sous chef (who ran the kitchen day-to-day) just left, and the local food blogger who drove traffic to their last restaurant does not cover this neighborhood. Enrollment depends on the network around the site, not just the site itself.
How KumoRFM solves this
Graph-learned clinical intelligence across your entire patient network
Kumo builds a graph connecting studies, sites, investigators, and patient catchment areas. It learns that sites where the principal investigator has co-published with the medical monitor and has referring relationships with 5+ PCPs in high-prevalence ZIP codes enroll 2.3x faster. The model captures investigator network effects, competing trial cannibalization, and seasonal patient availability patterns.
From data to predictions
See the full pipeline in action
Connect your tables, write a PQL query, and get predictions with built-in explainability — all in minutes, not months.
Your data
The relational tables Kumo learns from
STUDIES
| study_id | therapeutic_area | phase | target_enrollment |
|---|---|---|---|
| STU001 | Oncology | Phase III | 1200 |
| STU002 | Cardiology | Phase II | 450 |
SITES
| site_id | study_id | institution | region | activated_date |
|---|---|---|---|---|
| SITE01 | STU001 | Mass General | Northeast | 2025-01-15 |
| SITE02 | STU001 | Mayo Clinic | Midwest | 2025-01-20 |
| SITE03 | STU002 | Cleveland Clinic | Midwest | 2025-02-01 |
INVESTIGATORS
| investigator_id | site_id | name | publications | prior_trials |
|---|---|---|---|---|
| INV01 | SITE01 | Dr. Chen | 47 | 12 |
| INV02 | SITE02 | Dr. Patel | 23 | 6 |
| INV03 | SITE03 | Dr. Lopez | 31 | 9 |
PATIENTS
| patient_id | site_id | screened_date | enrolled | screen_fail_reason |
|---|---|---|---|---|
| PT01 | SITE01 | 2025-02-10 | Y | |
| PT02 | SITE01 | 2025-02-15 | N | Exclusion criteria |
| PT03 | SITE02 | 2025-02-20 | Y |
Write your PQL query
Describe what to predict in 2–3 lines — Kumo handles the rest
PREDICT COUNT(PATIENTS.*, 0, 90, days) FOR EACH SITES.SITE_ID WHERE PATIENTS.ENROLLED = 'Y'
Prediction output
Every entity gets a score, updated continuously
| SITE_ID | STUDY_ID | PREDICTED_ENROLLED_90D | TARGET_PCT |
|---|---|---|---|
| SITE01 | STU001 | 34 | 85% |
| SITE02 | STU001 | 12 | 30% |
| SITE03 | STU002 | 28 | 93% |
Understand why
Every prediction includes feature attributions — no black boxes
Site SITE02 -- Mayo Clinic, STU001
Predicted: 12 patients in 90 days (30% of target)
Top contributing features
Investigator prior trial enrollment rate
0.4x avg
29% attribution
Competing trials at same institution
3 active
24% attribution
Screen failure rate (first 30d)
62%
20% attribution
Referral network size (connected PCPs)
2 PCPs
15% attribution
Patient catchment prevalence
Low
12% attribution
Feature attributions are computed automatically for every prediction. No separate tooling required. Learn more about Kumo explainability
PQL Documentation
Learn the Predictive Query Language — SQL-like syntax for defining any prediction task in 2–3 lines.
Python SDK
Integrate Kumo predictions into your pipelines. Train, evaluate, and deploy models programmatically.
Explainability Docs
Understand feature attributions, model evaluation metrics, and how to build trust with stakeholders.
Frequently asked questions
Common questions about clinical trial enrollment
Why do 80% of clinical trials fail to meet enrollment timelines?
Enrollment failures stem from three connected factors: overestimated site capacity (investigators predict 30-50% more patients than they deliver), competing trial cannibalization (multiple trials recruiting from the same patient pool), and referral network gaps (the PI lacks active relationships with referring physicians who see eligible patients). Traditional site selection misses these network dynamics because it evaluates sites in isolation.
How much does clinical trial delay cost pharmaceutical companies?
Each day of Phase III delay costs $600K-$8M in lost patent life, depending on the drug's projected revenue. A trial with 150 sites where 40% underperform wastes $50M in site management costs. Identifying underperforming sites 60 days earlier through predictive models allows sponsors to reallocate resources and activate backup sites, recovering 3-4 months of enrollment time.
How can AI improve clinical trial site selection?
AI improves site selection by evaluating the network around each site, not just the site itself. Graph-based models assess investigator referral networks, competing trial activity, patient catchment demographics, and screen failure patterns to predict which sites will meet enrollment targets. Sites with strong referral networks and low competing-trial density enroll 2.3x faster than isolated sites with experienced but disconnected investigators.
What data sources improve clinical trial enrollment predictions?
The most predictive data sources are investigator referral networks (which PCPs refer patients to the PI), competing trial registrations (ClinicalTrials.gov), patient catchment demographics (disease prevalence by zip code), and early enrollment signals (screen failure rates in the first 30 days). Historical enrollment databases (Citeline, Medidata) provide a baseline, but network data is what separates accurate from average predictions.
Bottom line: A Phase III trial sponsor identifying underperforming sites 60 days earlier saves $50M in reallocation costs and accelerates enrollment by 4 months. Kumo captures investigator networks and competing trial dynamics that spreadsheet-based site selection misses.
Related use cases
Explore more healthcare use cases
Topics covered
One Platform. One Model. Infinite Predictions.
KumoRFM
Relational Foundation Model
Turn structured relational data into predictions in seconds. KumoRFM delivers zero-shot predictions that rival months of traditional data science. No training, feature engineering, or infrastructure required. Just connect your data and start predicting.
For critical use cases, fine-tune KumoRFM on your data using the Kumo platform and Research Agent for 30%+ higher accuracy than traditional models.
Book a demo and get a free trial of the full platform: research agent, fine-tune capabilities, and forward-deployed engineer support.




