Machine learning in clinical trials: We developed a clinical trial risk assessment tool using Natural Language Processing for the Gates Foundation to assist experts to estimate the risk of a clinical trial ending uninformatively.
We were contacted by the Bill and Melinda Gates Foundation, who wanted a tool to assist reviewers in quantifying the risk of a clinical trial protocol.
Natural language processing
A protocol is a PDF document, typically up to 200 pages long, and contains a complete description of the plan of a trial: where it will take place, how many subjects will be recruited (the sample size), which interventions are to be tested, and how the statistical analysis is to be conducted.
At this point, we had already built a similar solution to identify cost and complexity factors of clinical trials from the protocol text using NLP for the German pharma company Boehringer Ingelheim. You can read more about clinical trial complexity here.
Any organisation planning to fund a clinical trial must examine and stress-test the protocol thoroughly. The cost of running a trial is high and there are many points of potential failure. For example, if the sample size is too small, then the trial will not have sufficient statistical power to deliver an informative result and will not contribute to the body of knowledge of the funding organisation or the scientific community. This is called the risk of the trial ending uninformatively.
Protocols are written in technical English but are not constrained by any particular standard. Protocols from within a given organisation generally follow a rough pattern, but there are many ways that a particular data point can be communicated: the sample size could be referred to as the number of participants, N = 90, or the researchers could write simply we plan to enroll up to 100 subjects per site and leave it to the reader to infer the sample size.
The Gates Foundation needed an NLP model capable of quickly scanning a trial protocol and picking out key factors that could affect the risk of running the trial. They contracted Fast Data Science to build an NLP based clinical trial risk assessment tool to use machine learning and AI to assess future clinical trial protocols.
Thomas Wood’s presentation of the Clinical Trial Risk Tool at Plotly’s Dash In Action Webinar in June 2023.
We initially focused on HIV and TB trials and we are now extending the scope of the Clinical Trial Risk Tool to cover other disease areas such as Enteric and diarrheal diseases, Influenza, Motor neurone disease, Multiple sclerosis, Neglected tropical diseases, Oncology, COVID, Cystic fibrosis, Malaria, and Polio. We are also hoping to further develop the tool to predict trial cost in dollars.
Over a period of more than a year, we experimented with an ensemble of machine learning and rule-based models to extract features such as the pathology, phase, sample size, number of countries, number of arms, presence or absence of a statistical analysis plan, effect size, and whether simulation had been used to determine the sample size. These parameters were put into a simple linear risk model and the tool generates a PDF or Excel report which can be shared within the organisation.
We deployed the NLP clinical trial risk assessment tool to the internet at app.clinicaltrialrisk.org and open-sourced the code under MIT licence.
The Clinical Trial Risk Tool has enabled the Gates Foundation to assess incoming trials for rapid triage. It has also helped professionals worldwide to make a rough risk assessment of their trials before submitting them for funding.
You can read more about how we developed the clinical trial risk assessment tool in this blog post.
If you would like to cite the tool alone, you can cite:
Wood TA and McNair D. Clinical Trial Risk Tool: software application using natural language processing to identify the risk of trial uninformativeness. Gates Open Res 2023, 7:56 doi: 10.12688/gatesopenres.14416.1.
A BibTeX entry for LaTeX users about the clinical trial risk assessment tool is
@article{Wood_2023, doi = {10.12688/gatesopenres.14416.1}, url = {https://doi.org/10.12688%2Fgatesopenres.14416.1}, year = 2023, month = {apr}, publisher = {F1000 Research Ltd}, volume = {7}, pages = {56}, author = {Thomas A Wood and Douglas McNair}, title = {Clinical Trial Risk Tool: software application using natural language processing to identify the risk of trial uninformativeness}, journal = {Gates Open Research} }
What we can do for you