What data is needed for clinical trial enrollment?

Kumo connects directly to your existing relational tables: STUDIES, SITES, INVESTIGATORS, PATIENTS. No ETL or feature engineering required. Write a PQL query and get explainable predictions in minutes.

4Regression · Trial Enrollment

Clinical Trial Enrollment

“Which sites will meet enrollment targets?”

Book a demo and get a free trial of the full platform: research agent, fine-tune capabilities, and forward-deployed engineer support.

By submitting, you accept the Terms and Privacy Policy.

Loved by data scientists, ML engineers & CXOs at

A real-world example

Which sites will meet enrollment targets?

80% of clinical trials fail to meet enrollment timelines. Each day of delay costs a sponsor $600K-$8M in lost patent life. A Phase III trial with 150 sites where 40% underperform wastes $50M in site management costs alone. Site selection today relies on investigator surveys and historical spreadsheets, missing the network dynamics between investigators, referring physicians, and patient populations.

Quick answer

AI predicts clinical trial site enrollment performance by connecting study details, site characteristics, investigator networks, patient catchment data, and competing trial activity into a relational graph. 80% of clinical trials fail to meet enrollment timelines, costing sponsors $600K-$8M per day of delay in lost patent life. Graph-based models identify underperforming sites 60 days earlier than traditional tracking, saving $50M in reallocation costs and accelerating enrollment by 4 months.

Approaches compared

4 ways to solve this problem

1. Investigator Surveys and Historical Data

Site selection based on investigator self-reported patient access, prior trial experience, and institutional track records. The traditional approach for Phase I-III planning. Heavily reliant on the investigator's own estimate of how many patients they can enroll.

Best for

Early feasibility assessment when you need a rough estimate of which regions and institutions to target.

Watch out for

Investigators consistently overestimate their enrollment capacity by 30-50%. Their estimates do not account for competing trials, seasonal patient availability, or changes in their referral network since the last trial.

2. Historical Site Performance Databases

Vendor databases (Citeline, Medidata) that track historical enrollment rates by site, investigator, and therapeutic area. Provides empirical performance data rather than self-reported estimates.

Best for

Ranking sites by demonstrated enrollment speed for similar trials. Data-driven alternative to investigator surveys.

Watch out for

Historical performance does not predict future performance when conditions change. A site that enrolled 40 patients in 90 days for the last oncology trial may now have 3 competing trials active, a new institutional review process, or a principal investigator who moved to a different institution.

3. Statistical Enrollment Models

Poisson regression or survival models trained on historical enrollment data to predict site-level enrollment rates. Accounts for therapeutic area, site size, and geographic factors.

Best for

Portfolio-level enrollment forecasting where you need aggregate predictions across 100+ sites for supply planning and milestone tracking.

Watch out for

Cannot capture network dynamics: investigator-referrer relationships, competing trial cannibalization, or the effect of the medical monitor's relationship with the PI on site engagement. These factors drive the 3-5x variation between top and bottom quartile sites.

4. Graph Neural Networks (Kumo's Approach)

Connects studies, sites, investigators, referral networks, patient catchment areas, and competing trials into a relational graph. Predicts per-site enrollment by learning from the full network of factors that drive patient access.

Best for

Identifying the network factors that separate high-enrolling sites from low-enrolling ones: investigator-referrer relationships, competing trial saturation, and patient catchment dynamics. Predicts performance for new sites with new investigators by transferring learning from similar network structures.

Watch out for

Requires access to investigator network data, competing trial registrations, and patient catchment demographics. ClinicalTrials.gov data is public, but detailed referral network data may require third-party sources.

Key metric: 80% of clinical trials miss enrollment timelines. Graph-based models identify underperforming sites 60 days earlier, saving $50M in reallocation costs and recovering 4 months of enrollment time.

Why relational data changes the answer

Flat enrollment models see each site as an independent row: investigator publications, prior trial count, institution tier, therapeutic area. They can predict that a site with an experienced investigator at a major academic center will enroll well on average. But they cannot see that this specific investigator's referral network has shrunk (their top 3 referring PCPs retired or moved), that 3 competing oncology trials opened at the same institution last quarter, that the screen failure rate in the first 30 days is running at 62% (suggesting enrollment criteria mismatch with the local patient population), and that the PI has only 2 connected PCPs versus 8 at the top-performing site. These network signals explain why Site SITE02 at Mayo Clinic is tracking to 30% of target while Site SITE01 at Mass General is at 85%.

Relational learning maps the trial ecosystem. The model walks from site to investigator to their referral network (how many PCPs refer patients, and are those PCPs still active), to competing trials at the same institution (are they recruiting from the same patient pool), to patient catchment demographics (does the local population match the enrollment criteria). It learns that sites where the PI has co-published with the medical monitor, maintains active referral relationships with 5+ PCPs in high-prevalence zip codes, and has no more than 1 competing trial in the same therapeutic area enroll 2.3x faster. These relational patterns are invisible to any model that scores sites as independent rows.

Predicting trial enrollment from site profiles is like predicting a restaurant's opening-month revenue based on the chef's resume and the neighborhood demographics. You miss that three competing restaurants just opened on the same block, the chef's sous chef (who ran the kitchen day-to-day) just left, and the local food blogger who drove traffic to their last restaurant does not cover this neighborhood. Enrollment depends on the network around the site, not just the site itself.

How KumoRFM solves this

Graph-learned clinical intelligence across your entire patient network

Kumo builds a graph connecting studies, sites, investigators, and patient catchment areas. It learns that sites where the principal investigator has co-published with the medical monitor and has referring relationships with 5+ PCPs in high-prevalence ZIP codes enroll 2.3x faster. The model captures investigator network effects, competing trial cannibalization, and seasonal patient availability patterns.

From data to predictions

See the full pipeline in action

Connect your tables, write a PQL query, and get predictions with built-in explainability — all in minutes, not months.

Your data

The relational tables Kumo learns from

STUDIES

study_id	therapeutic_area	phase	target_enrollment
STU001	Oncology	Phase III	1200
STU002	Cardiology	Phase II	450

SITES

site_id	study_id	institution	region	activated_date
SITE01	STU001	Mass General	Northeast	2025-01-15
SITE02	STU001	Mayo Clinic	Midwest	2025-01-20
SITE03	STU002	Cleveland Clinic	Midwest	2025-02-01

INVESTIGATORS

investigator_id	site_id	name	publications	prior_trials
INV01	SITE01	Dr. Chen	47	12
INV02	SITE02	Dr. Patel	23	6
INV03	SITE03	Dr. Lopez	31	9

PATIENTS

patient_id	site_id	screened_date	enrolled	screen_fail_reason
PT01	SITE01	2025-02-10	Y
PT02	SITE01	2025-02-15	N	Exclusion criteria
PT03	SITE02	2025-02-20	Y

Write your PQL query

Describe what to predict in 2–3 lines — Kumo handles the rest

PQL

PREDICT COUNT(PATIENTS.*, 0, 90, days)
FOR EACH SITES.SITE_ID
WHERE PATIENTS.ENROLLED = 'Y'

Prediction output

Every entity gets a score, updated continuously

SITE_ID	STUDY_ID	PREDICTED_ENROLLED_90D	TARGET_PCT
SITE01	STU001	34	85%
SITE02	STU001	12	30%
SITE03	STU002	28	93%

Understand why

Every prediction includes feature attributions — no black boxes

Site SITE02 -- Mayo Clinic, STU001

Predicted: 12 patients in 90 days (30% of target)

Top contributing features

Investigator prior trial enrollment rate

0.4x avg

29% attribution

Competing trials at same institution

3 active

24% attribution

Screen failure rate (first 30d)

62%

20% attribution

Referral network size (connected PCPs)

2 PCPs

15% attribution

Patient catchment prevalence

Low

12% attribution

Feature attributions are computed automatically for every prediction. No separate tooling required. Learn more about Kumo explainability

PQL Documentation

Learn the Predictive Query Language — SQL-like syntax for defining any prediction task in 2–3 lines.

Read docs

Python SDK

Integrate Kumo predictions into your pipelines. Train, evaluate, and deploy models programmatically.

Read docs

Explainability Docs

Understand feature attributions, model evaluation metrics, and how to build trust with stakeholders.

Read docs

Frequently asked questions

Common questions about clinical trial enrollment

Why do 80% of clinical trials fail to meet enrollment timelines?

Enrollment failures stem from three connected factors: overestimated site capacity (investigators predict 30-50% more patients than they deliver), competing trial cannibalization (multiple trials recruiting from the same patient pool), and referral network gaps (the PI lacks active relationships with referring physicians who see eligible patients). Traditional site selection misses these network dynamics because it evaluates sites in isolation.

How much does clinical trial delay cost pharmaceutical companies?

Each day of Phase III delay costs $600K-$8M in lost patent life, depending on the drug's projected revenue. A trial with 150 sites where 40% underperform wastes $50M in site management costs. Identifying underperforming sites 60 days earlier through predictive models allows sponsors to reallocate resources and activate backup sites, recovering 3-4 months of enrollment time.

How can AI improve clinical trial site selection?

AI improves site selection by evaluating the network around each site, not just the site itself. Graph-based models assess investigator referral networks, competing trial activity, patient catchment demographics, and screen failure patterns to predict which sites will meet enrollment targets. Sites with strong referral networks and low competing-trial density enroll 2.3x faster than isolated sites with experienced but disconnected investigators.

What data sources improve clinical trial enrollment predictions?

The most predictive data sources are investigator referral networks (which PCPs refer patients to the PI), competing trial registrations (ClinicalTrials.gov), patient catchment demographics (disease prevalence by zip code), and early enrollment signals (screen failure rates in the first 30 days). Historical enrollment databases (Citeline, Medidata) provide a baseline, but network data is what separates accurate from average predictions.

Bottom line: A Phase III trial sponsor identifying underperforming sites 60 days earlier saves $50M in reallocation costs and accelerates enrollment by 4 months. Kumo captures investigator networks and competing trial dynamics that spreadsheet-based site selection misses.

Related use cases

Explore more healthcare use cases

Use Case #2No-Show PredictionLearn more

Use Case #5Claims DenialLearn more

Use Case #6Patient DeteriorationLearn more

Previous#3 Length of Stay

Next#5 Claims Denial

Topics covered

clinical trial enrollment predictionsite selection AItrial recruitment optimizationinvestigator performance modelpatient enrollment forecastinggraph neural network clinical trialsKumoRFM clinical trialspharma trial optimizationenrollment rate prediction

From a leadership team with proven experience

Vanja Josifovski

CEO and Co-Founder, ex-CTO Airbnb, ex-CTO Pinterest

Jure Leskovec

Co-Founder & Chief Scientist, Stanford Professor

Hema Raghavan

Co-Founder & Head of Engineering, ex-AI Lead, LinkedIn

One Platform. One Model. Infinite Predictions.

KumoRFM

Relational Foundation Model

Turn structured relational data into predictions in seconds. KumoRFM delivers zero-shot predictions that rival months of traditional data science. No training, feature engineering, or infrastructure required. Just connect your data and start predicting.

For critical use cases, fine-tune KumoRFM on your data using the Kumo platform and Research Agent for 30%+ higher accuracy than traditional models.

Book a demo and get a free trial of the full platform: research agent, fine-tune capabilities, and forward-deployed engineer support.

Book a Demo Try Free