Open Source Tools for Natural Language Processing

Open Source Tools for Natural Language Processing

Open source projects (MIT license)

We have participated in two externally projects which produced open-source code and data, which are available to the public for personal and commercial use.

Harmony

(Github repo) - Harmony is a tool and research project using natural language processing to harmonise mental health data. Read more at https://harmonydata.ac.uk and try the demo at https://harmonydata.ac.uk/app/. Funded by the Wellcome Trust and adhering to the MIT license and FAIR data principles.

  • Wood, T.A., McElroy, E., Moltrecht, B., Ploubidis, G.B., Scopel Hoffmann, M., Harmony [Computer software], Version 1.0, accessed at https://harmonydata.ac.uk/app. Ulster University (2022)
  • McElroy, E., Moltrecht, B., Scopel Hoffmann, M., Wood, T. A., & Ploubidis, G. (2023, January 6). Harmony – A global platform for contextual harmonisation, translation and cooperation in mental health research. Retrieved from osf.io/bct6k

Clinical Trial Risk Tool

(Github repo) - a tool using natural language processing to categorise clinical trial protocols (PDFs) into high, medium or low risk. Read more at https://clinicaltrialrisk.org/ and try the demo at https://app.clinicaltrialrisk.org/.

Clinical Trial Risk Tool
  • Wood TA and McNair D. Clinical Trial Risk Tool: software application using natural language processing to identify the risk of trial uninformativeness. Gates Open Res 2023, 7:56 doi: 10.12688/gatesopenres.14416.1.

Fast Data Science - London

Need a business solution?

NLP, ML and data science leader since 2016 - get in touch for an NLP consulting session.

Other open source NLP libraries

In addition to the externally funded projects above, we have provided a number of low-level libraries for specific use cases in natural language processing.

Local Spelling

(Github repo) - a library for localising spelling between US and UK variants. Install from the command line with pip install localspelling

Country Named Entity Recognition

(Github repo) - a lightweight Python library for recognising country names in unstructured text and returning Pycountry objects. Tutorial here. Install with pip install country_named_entity_recognition

Drug Named Entity Recognition

(Github repo) - a lightweight Python library for recognising drug names in unstructured text and performing named entity linking to DrugBank IDs. Tutorial here. Install with pip install drug-named-entity-recognition

Fast Stylometry

(Github repo) - a Python library for forensic stylometry. Read tutorialpip install faststylometry

Open source software

Open source software is software that is made freely available to the public. It is typically developed and maintained by a community of developers who collaborate to improve the software and make it available to the public. Open source software is often seen as an alternative to proprietary software, as it is usually free to use and modify. Some of the most popular open source licenses are the MIT License and the Apache License, which both permit a user to modify software and use it in commercial applications.

Open source software has become increasingly important in the field of natural language processing, as NLP systems are becoming more complex and reaching into more and more fields of our lives, from household uses such as Amazon’s Alexa, to applications in industries such as pharmaceuticals, such as drug discovery, or clinical trial risk management. Open source natural language processing tools allow developers to collaborate to create innovative solutions to problems in natural language processing, and can help to reduce the cost of developing natural language processing systems.

Open data and the FAIR data principles

Open data and FAIR data principles are two important concepts in the data sharing and data management world. Open data refers to data that is freely available and accessible to the public. The FAIR data principles are a set of guidelines published in Nature in 2016, aiming to ensure data is Findable, Accessible, Interoperable, and Reusable.

  • Findability: Data should be easy to find, access, and use.
  • Accessibility: Data should be available to everyone who has a legitimate interest in using it.
  • Interoperability: Data should be able to be shared, combined, and compared with other data sets.
  • Reusability: Data should be easy to reuse and repurpose.
  • Accountability: Data should be traceable to its source and users should be held accountable for its use.

What we can do for you

Transform Unstructured Data into Actionable Insights

Contact us