Clinical trial cost modelling with NLP and AI

Modelling risk and cost in clinical trials with NLP

Fast Data Science’s Clinical Trial Risk Tool

Clinical trials are a vital part of bringing new drugs to market, but planning and running them can be a complex and expensive process. A key part of this planning is accurately estimating the cost and risk of a trial. Traditionally, this has involved a team of experts manually sifting through lengthy clinical trial protocols, often hundreds of pages long.

Fast Data Science is continuing to develop the Clinical Trial Risk Tool, an NLP model that uses natural language processing (NLP) to analyse clinical trial protocols and automatically extract crucial cost and risk factors.

Clinical trial protocols

A clinical trial protocol is an important part of running a trial and it is drafted before the clinical trial begins. It’s a long document, often around 200 pages long in PDF format. Trial protocols describe the objectives and design of a trial, provide the rationale and the background for the study, and they have to meet standards that adhere to the principles of good clinical practice.

Estimating the cost or the risk of a clinical trial is a difficult problem, because it is a business assessment that needs a team of experts who must read the protocol thoroughly. This is usually done manually inside pharma companies, funding organisations, or contract research organisations (CROs). The protocols are typically written in dense, highly specialised language but are fundamentally in natural language rather than a structured or tabular format. This makes the problem an ideal field for AI and NLP within healthcare.

What is NLP and how does it work?

Natural language processing (NLP) is the sub-field of AI around enabling computers to understand and process human language. The Clinical Trial Risk Tool uses NLP to “read” clinical trial protocols and extract key information like the type of treatment, the condition being studied (pathology), and the number of participants required. This extracted data is then used to estimate potential risks and costs associated with the trial.

Estimate the risk or cost of a clinical trial

Try the Clinical Trial Risk Tool

You can try the free (HIV/TB) version of the tool which is online at https://clinicaltrialrisk.org/tool, and you can contact us to discuss our ongoing work on cost/risk modelling for other pathologies.

Try the tool

Benefits of the Clinical Trial Risk Tool

Enter AI in pharma:

Faster and More Efficient Planning: By automating the analysis of lengthy protocols, the Clinical Trial Risk Tool saves companies and organisations significant time and resources during the planning stages of a trial.
Improved Cost Estimation: Extracting key factors from the protocol allows for a more accurate prediction of trial costs, leading to better budgeting and resource allocation.
Reduced Risk: Identifying potential risks early in the planning process allows for mitigation strategies to be developed, reducing the chance of costly delays or failures.

We originally developed the Clinical Trial Risk Tool to cover Tuberculosis and HIV trial cost estimation, and has since been extended to cover other disease indications including COVID, Cystic fibrosis, Enteric and diarrheal diseases clinical trials cost models, Influenza clinical trials cost modelling, Malaria clinical trials cost models, Motor neurone disease, Multiple sclerosis, Neglected tropical diseases clinical trials cost modelling, Oncology, and Polio clinical trials cost modelling.

Open source and collaborative development

The initial version of the Clinical Trial Risk Tool, focused on HIV and TB trials in low and middle-income countries, is completely open source. This allows for collaboration and further development by the wider scientific community.

Fast Data Science is actively seeking partners in the pharmaceutical industry, such as funders, pharma companies, MedTech, research organisations, and CROs, to expand the tool’s capabilities to cover a wider range of pathologies, trial phases, and intervention types. A major challenge is access to confidential industry data on cost of trials, as many protocols are not publicly available on repositories such as ClinicalTrials.gov or EudraCT, and cost data is usually not publicised.

You can download and run the source code at https://github.com/fastdatascience/clinical_trial_risk.

The future of clinical trial cost modelling

Fast Data Science is constantly improving the Clinical Trial Risk Tool, including the development of regression models to predict the dollar cost of running a trial based on historical data. This combination of NLP and machine learning holds great promise for streamlining clinical trial planning, reducing costs, and ultimately accelerating the development of new treatments.

A regression line for the cost of running a clinical trial

Download the pitch deck for the Clinical Trial Risk Tool

What about generative AI and GPT?

Documents in clinical research are often highly confidential. We have not used generative models such as GPT-4 or Google Gemini in this project, as they are often not fast enough for our needs, and would struggle on a document as large as those used in pharmaceuticals, and also do not perform as well on the highly domain-specific tasks.

Furthermore, we are aware that research organisations may not consent to our sending their data to a third party generative model. For that reason, the CTRT runs entirely on our own cloud platform (Microsoft and Amazon servers) in a secure environment. The model is open source and communication is over HTTPS, and you can be sure that we are not sending your data to a third party generative AI company. There is the option of self-hosting a generative model, but this would not overcome the other limitations of generative AI for this use case.

For our work with generative AI, please check out our Insolvency Bot, an interesting application of generative AI in the legal domain which uses retrieval augmented generation (RAG) to provide answers in a highly specialised domain, and also our experiments with generative AI detection.

Coverage of the Clinical Trial Risk Tool on other sites

This work was supported, in whole or in part, by the Bill and Melinda Gates Foundation [INV-050345], and we are very grateful for this support.

An article describing the tool has been published at: Wood TA and McNair D. Clinical Trial Risk Tool: software application using natural language processing to identify the risk of trial uninformativeness. Gates Open Res 2023, 7:56 (https://doi.org/10.12688/gatesopenres.14416.1).

The tool also won 🥇 first place in the Plotly Dash App Challenge in 2023:

Thank you to all #PlotlyCommunity members who participated in the recent #Dash Example Apps Challenge, and congratulations to the winning submissions!

🥇 Clinical Trial Risk Dash App by Thomas Wood
🥈 SARIMA Tuner by Gabriele Albini
🥉 Product Environmental Report Dash App by…
— Plotly (@plotlygraphs) May 22, 2023

How to cite the Clinical Trial Risk Tool?

If you would like to cite the tool alone, you can cite:

Wood TA and McNair D. Clinical Trial Risk Tool: software application using natural language processing to identify the risk of trial uninformativeness. Gates Open Res 2023, 7:56 doi: 10.12688/gatesopenres.14416.1.

A BibTeX entry for LaTeX users is

@article{Wood_2023,
	doi = {10.12688/gatesopenres.14416.1},
	url = {https://doi.org/10.12688%2Fgatesopenres.14416.1},
	year = 2023,
	month = {apr},
	publisher = {F1000 Research Ltd},
	volume = {7},
	pages = {56},
	author = {Thomas A Wood and Douglas McNair},
	title = {Clinical Trial Risk Tool: software application using natural language processing to identify the risk of trial uninformativeness},
	journal = {Gates Open Research}
}