Clinical trial cost modelling with NLP and AI

· Thomas Wood
Clinical trial cost modelling with NLP and AI

Find Top NLP Talent!

Looking for experts in Natural Language Processing? Post your job openings with us and find your ideal candidate today!

Post a Job

Modelling risk and cost in clinical trials with NLP

Fast Data Science’s Clinical Trial Risk Tool

Clinical trials are a vital part of bringing new drugs to market, but planning and running them can be a complex and expensive process. A key part of this planning is accurately estimating the cost and risk of a trial. Traditionally, this has involved a team of experts manually sifting through lengthy clinical trial protocols, often hundreds of pages long.

Cost prediction for a clinical trial using NLP

Fast Data Science is continuing to develop the Clinical Trial Risk Tool, an NLP model that uses natural language processing (NLP) to analyse clinical trial protocols and automatically extract crucial cost and risk factors.

Clinical trial protocols

A clinical trial protocol is an important part of running a trial and it is drafted before the clinical trial begins. It’s a long document, often around 200 pages long in PDF format. Trial protocols describe the objectives and design of a trial, provide the rationale and the background for the study, and they have to meet standards that adhere to the principles of good clinical practice.

Estimating the cost or the risk of a clinical trial is a difficult problem, because it is a business assessment that needs a team of experts who must read the protocol thoroughly. This is usually done manually inside pharma companies, funding organisations, or contract research organisations (CROs). The protocols are typically written in dense, highly specialised language but are fundamentally in natural language rather than a structured or tabular format. This makes the problem an ideal field for AI and NLP within healthcare.

What is NLP and how does it work?

Natural language processing (NLP) is the sub-field of AI around enabling computers to understand and process human language. The Clinical Trial Risk Tool uses NLP to “read” clinical trial protocols and extract key information like the type of treatment, the condition being studied (pathology), and the number of participants required. This extracted data is then used to estimate potential risks and costs associated with the trial.

Estimate the risk or cost of a clinical trial

Try the Clinical Trial Risk Tool

You can try the free (HIV/TB) version of the tool which is online at https://app.clinicaltrialrisk.org/, and you can contact us to discuss our ongoing work on cost/risk modelling for other pathologies.

Benefits of the Clinical Trial Risk Tool

  • Faster and More Efficient Planning: By automating the analysis of lengthy protocols, the Clinical Trial Risk Tool saves companies and organisations significant time and resources during the planning stages of a trial.
  • Improved Cost Estimation: Extracting key factors from the protocol allows for a more accurate prediction of trial costs, leading to better budgeting and resource allocation.
  • Reduced Risk: Identifying potential risks early in the planning process allows for mitigation strategies to be developed, reducing the chance of costly delays or failures.
Cost prediction for a clinical trial using NLP

Open source and collaborative development

The initial version of the Clinical Trial Risk Tool, focused on HIV and TB trials in low and middle-income countries, is completely open source. This allows for collaboration and further development by the wider scientific community.

Fast Data Science is actively seeking partners in the pharmaceutical industry, such as funders, pharma companies, MedTech, research organisations, and CROs, to expand the tool’s capabilities to cover a wider range of pathologies, trial phases, and intervention types. A major challenge is access to confidential industry data on cost of trials, as many protocols are not publicly available on repositories such as ClinicalTrials.gov or EudraCT, and cost data is usually not publicised.

You can download and run the source code at https://github.com/fastdatascience/clinical_trial_risk.

The future of clinical trial cost modelling

Fast Data Science is constantly improving the Clinical Trial Risk Tool, including the development of regression models to predict the dollar cost of running a trial based on historical data. This combination of NLP and machine learning holds great promise for streamlining clinical trial planning, reducing costs, and ultimately accelerating the development of new treatments.

A regression line for the cost of running a clinical trial, demonstrating the value of AI in pharma.

A regression line for the cost of running a clinical trial

Download the pitch deck for the Clinical Trial Risk Tool

What about generative AI and GPT?

Documents in clinical research are often highly confidential. We have not used generative models such as GPT-4 or Google Gemini in this project, as they are often not fast enough for our needs, and would struggle on a document as large as those used in pharmaceuticals, and also do not perform as well on the highly domain-specific tasks.

Furthermore, we are aware that research organisations may not consent to our sending their data to a third party generative model. For that reason, the CTRT runs entirely on our own cloud platform (Microsoft and Amazon servers) in a secure environment. The model is open source and communication is over HTTPS, and you can be sure that we are not sending your data to a third party generative AI company. There is the option of self-hosting a generative model, but this would not overcome the other limitations of generative AI for this use case.

For our work with generative AI, please check out our Insolvency Bot, an interesting application of generative AI in the legal domain which uses retrieval augmented generation (RAG) to provide answers in a highly specialised domain, and also our experiments with generative AI detection.

Coverage of the Clinical Trial Risk Tool on other sites

This work was supported, in whole or in part, by the Bill and Melinda Gates Foundation [INV-050345], and we are very grateful for this support.

An article describing the tool has been published at: Wood TA and McNair D. Clinical Trial Risk Tool: software application using natural language processing to identify the risk of trial uninformativeness. Gates Open Res 2023, 7:56 (https://doi.org/10.12688/gatesopenres.14416.1).

The tool also won 🥇 first place in the Plotly Dash App Challenge in 2023:

How to cite the Clinical Trial Risk Tool?

If you would like to cite the tool alone, you can cite:

Wood TA and McNair D. Clinical Trial Risk Tool: software application using natural language processing to identify the risk of trial uninformativeness. Gates Open Res 2023, 7:56 doi: 10.12688/gatesopenres.14416.1.

A BibTeX entry for LaTeX users is

@article{Wood_2023,
	doi = {10.12688/gatesopenres.14416.1},
	url = {https://doi.org/10.12688%2Fgatesopenres.14416.1},
	year = 2023,
	month = {apr},
	publisher = {F1000 Research Ltd},
	volume = {7},
	pages = {56},
	author = {Thomas A Wood and Douglas McNair},
	title = {Clinical Trial Risk Tool: software application using natural language processing to identify the risk of trial uninformativeness},
	journal = {Gates Open Research}
}

Your NLP Career Awaits!

Ready to take the next step in your NLP journey? Connect with top employers seeking talent in natural language processing. Discover your dream job!

Find Your Dream Job

Fast Data Science webinar on AI and NLP in pharmaceuticals
Data science

Fast Data Science webinar on AI and NLP in pharmaceuticals

On 29 May, Thomas Wood presented a webinar on how AI and Natural Language Processing (NLP) can transform clinical trials in the pharmaceutical industry.

AI in business and industry
Ai and societyData science

AI in business and industry

AI in business and industry Artificial intelligence (AI) is a hot topic in business, but many companies are unsure how to leverage it effectively.

Semantic similarity with sentence embeddings
Data scienceNatural language processing

Semantic similarity with sentence embeddings

In natural language processing, we have the concept of word vector embeddings and sentence embeddings. This is a vector, typically hundreds of numbers, which represents the meaning of a word or sentence.

What we can do for you

Transform Unstructured Data into Actionable Insights

Contact us