Open source software and natural language processing

Open source software is software that is made freely available to the public. It is typically developed and maintained by a community of developers who collaborate to improve the software and make it available for anyone to use, ideally with no strings attached.

Open source software is often seen as an alternative to proprietary software, as it is usually free to use and modify. The rules about what you’re allowed to do with open source software (e.g. are you allowed to make money with it) are set out in a document called the licence. Some of the most popular open source licences are the MIT License and the Apache License, which both permit a user to modify software and use it in commercial applications.

Open source software has become increasingly important in natural language processing, as NLP systems are becoming more complex and reaching into more and more fields of our lives, from household uses, to applications in industries such as pharmaceuticals, such as drug discovery, or clinical trial risk management. Open source natural language processing tools allow developers to collaborate to create innovative solutions to problems in natural language processing, and can help to reduce the cost of developing natural language processing systems.

An ideal open source project is a public good and people may contribute to it to gain experience, to add to their portfolio and be more attractive to employers, or because the topic is an area of passion for them. Contributors on open source projects can be motivated by extrinsic factors or pure altruism.[1]

At Fast Data Science we have worked on several open source NLP projects in fields from psychology to pharmaceuticals. All our projects use the MIT License.

You can find the source code of all our open source NLP projects either on our Github account or on the page for the Harmony project.

Fast Data Science - London

Unsure which open source NLP tool to use?

We are developing our own open source NLP tools. We will be glad to advise you on open source NLP software.

Closed source natural language processing tools

It should be pointed out that models such as OpenAI’s GPT API, Google Vertex, and commercial offerings by tech giants are usually closed source. These can be convenient to get started with and may come with dedicated support from the vendor, but usually there will be a cost such as a licence fee.

Open source NLP projects run by Fast Data Science under MIT License

We have taken a lead role in developing a number of open source NLP projects, which are available to the public for personal and commercial use. These are all under the MIT License, which allows anyone to use our code for commercial purposes without an obligation to make their derivative work also open source.

Interested in other open source libraries? Find out about the third party open source libraries that Fast Data Science recommends here!

(Github repo) - Harmony is a tool and research project using natural language processing to harmonise mental health data. Read more at https://harmonydata.ac.uk and try the demo at https://harmonydata.ac.uk/app/. Funded by the Wellcome Trust and adhering to the MIT license and FAIR data principles.

McElroy, E., Wood, T., Bond, R. et al. Using natural language processing to facilitate the harmonisation of mental health questionnaires: a validation study using real-world data. BMC Psychiatry 24, 530 (2024). https://doi.org/10.1186/s12888-024-05954-2
Wood, T.A., McElroy, E., Moltrecht, B., Ploubidis, G.B., Scopel Hoffmann, M., Harmony [Computer software], Version 1.0, accessed at https://harmonydata.ac.uk/app. Ulster University (2022)
McElroy, E., Moltrecht, B., Scopel Hoffmann, M., Wood, T. A., & Ploubidis, G. (2023, January 6). Harmony – A global platform for contextual harmonisation, translation and cooperation in mental health research. Retrieved from osf.io/bct6k

Clinical Trial Risk Tool: an open source NLP tool for analysing clinical trial protocols

(Github repo) - a tool using natural language processing to categorise clinical trial protocols (PDFs) into high, medium or low risk. Read more at https://clinicaltrialrisk.org/ and try the demo at https://clinicaltrialrisk.org/tool.

Wood TA and McNair D. Clinical Trial Risk Tool: software application using natural language processing to identify the risk of trial uninformativeness. Gates Open Res 2023, 7:56 doi: 10.12688/gatesopenres.14416.1.

Other open source NLP projects

In addition to the externally funded projects above, we have developed some open source NLP tools aimed at Python developers.

Drug Named Entity Recognition

A lightweight Python library for recognising drug names in unstructured text and performing named entity linking to DrugBank IDs.

The Drug Named Entity Recognition package allows you to identify drug names in an English text and find identifiers (MeSH, Drugbank, NHS) and even returns the molecular structure of a drug! It’s also available as a Google Sheets plugin.

Install with:

pip install drug-named-entity-recognition

Wood, T.A., Drug Named Entity Recognition [Computer software], Version 0.1, accessed at https://fastdatascience.com/drug-named-entity-recognition-python-library/, Fast Data Science Ltd (2022)

Medical Named Entity Recognition

A similarly lightweight Python library for recognising clinical terminology in unstructured text and performing named entity linking to MeSH IDs.

The Medical Named Entity Recognition package finds clinical terminology and maps to the Medical Subject Headings (MeSH) thesaurus.

Install with:

pip install drug-named-entity-recognition

Wood, T.A., Drug Named Entity Recognition [Computer software], Version 0.1, accessed at https://fastdatascience.com/drug-named-entity-recognition-python-library/, Fast Data Science Ltd (2022)

Local Spelling

Localspelling is a library for localising spelling between US and UK variants. Install from the command line with pip install localspelling

Github repo for Localspelling

Citations

Arase, Yuki, Satoru Uchida, and Tomoyuki Kajiwara. CEFR-based sentence difficulty annotation and assessment. arXiv preprint arXiv:2210.11766 (2022).

Country Named Entity Recognition

This is a lightweight Python library for recognising country names in unstructured text and returning Pycountry objects. Tutorial here.

Install with:

pip install country_named_entity_recognition

Github repo for Country Named Entity Recognition

Cite as:

Wood, T.A., Country Named Entity Recognition [Computer software], Version 0.4, accessed at https://fastdatascience.com/country-named-entity-recognition/, Fast Data Science Ltd (2022)

Citations

Alisa Redding at the University of Helsinki used the tool for her Masters thesis on mass species extinction and biodiversity.

Redding, Alisa, Animals of the Digital Age : Assessing digital media for public interest and engagement in species threatened by wildlife trade., University of Helsinki, Faculty of Science, 2023.

Fast Stylometry

Fast Stylometry is a Python library for forensic stylometry. Read tutorial.

Install with:

pip install faststylometry

Github repo for Fast Stylometry

Works citing the Fast Stylometry Python library

Kang, Yong-Bin, Anthony McCosker, and Jane Farmer. Leveraging stylometry analysis to identify unique characteristics of peer support user groups in online mental health forums. Scientific Reports 13.1 (2023): 22979.
Hicke, Rebecca MM, and David Mimno. T5 meets Tybalt: Author Attribution in Early Modern English Drama Using Large Language Models. arXiv preprint arXiv:2310.18454 (2023).
Björk-Olsén, Stefan, Detektering av genererade texter: Stilometri - Är språkmodeller för genomsnittliga?, Bachelors thesis (2024)

Open source NLP tools for general language processing tasks

There are many free and open source (FOSS) tools available for natural language processing purposes. Which one you choose for a project depends on the precise use case.

Tool	Licence	Summary
spaCy	MIT	spaCy is a great all-round tool for NLP projects. You can use it for rule-based and pattern-based matching, training text classifiers, custom named entity recognition models, embeddings and transformers, and extracting grammatical relations. We are using spaCy inside our Clinical Trial Risk Tool
Natural Language Toolkit (NLTK)	Apache 2.0	NLTK is a great platform for processing text with Python. It pre-dates neural networks, so it does a lot of “traditional” NLP such as tokenising, stemming, stopwords, dictionaries, etc. It also comes with corpora to import.
Sentence Transformers at HuggingFace	Depends on model	HuggingFace has provided an easy interface to the sentence transformers models, allowing you to run an LLM on your PC in a few steps. We are using HuggingFace Sentence Transformers as the backbone of the Harmony project (more information below).
Scikit-Learn	BSD 3-Clause	A fantastic all-round machine learning library in Python which has some really useful simple classifiers such as Naive Bayes, which can be built as part of a pipeline and serialised and deserialised.

Which open source NLP tool should I use?

Here’s an overview of when you might want to use the big tools mentioned (we are only covering Python tools in this article). You can really give your NLP project a head start by choosing the appropriate open source NLP tool as a foundation!

Use case	Tool
Language learning app needing to find grammatical structure in sentences	spaCy
Simple low-footprint text classifier e.g. email triage on a serverless app, which doesn’t need to be very sophisticated. Text is in a single language. There are a small number of categories e.g. <10, you may not have a huge amount of data, and the categories are easily separable by presence/absence of key words, e.g. economics vs sport (as opposed to “longitudinal studies in psychology” vs “cohort studies in psychology” which would be much harder to distinguish)	Scikit-Learn
Sophisticated text classifier which needs to take into account context of words in sentence; smart AI tool which psychologists can use to compare text data or find similar documents	Sentence Transformers at HuggingFace
Analysis of N-grams in a corpus, finding clusters in unstructured documents	Natural Language Toolkit (NLTK): this library has a great implementation of Latent Dirichlet Allocation

Open data and the FAIR data principles

Open data and FAIR data principles are two important concepts in the data sharing and data management world. Open data refers to data that is freely available and accessible to the public. The FAIR data principles are a set of guidelines published in Nature in 2016, aiming to ensure data is Findable, Accessible, Interoperable, and Reusable.

Findability: Data should be easy to find, access, and use.
Accessibility: Data should be available to everyone who has a legitimate interest in using it.
Interoperability: Data should be able to be shared, combined, and compared with other data sets.
Reusability: Data should be easy to reuse and repurpose.
Accountability: Data should be traceable to its source and users should be held accountable for its use.

Fast Data Science, a pioneer in natural language processing solutions, champions open-source innovation since 2016. Our natural language processing solutions empower businesses to extract insights from unstructured text, leveraging open-source natural language processing tools like NLTK and spaCy. These nlp tools enable efficient processing of text data, from tasks like sentiment analysis to entity recognition, driving cost-effective outcomes for clients in healthcare, legal, and more. The natural language processing solutions we specialise in integrate seamlessly with platforms like AWS, ensuring scalability and accessibility.

For instance, we’ve developed NLP tools for projects like the Clinical Trial Risk Tool, automating complex document analysis. Our commitment to open-source fosters transparency and collaboration, allowing businesses to customize natural language processing tools for specific needs.

Since 2016, led by Thomas Wood with a Cambridge Masters in NLP, Fast Data Science has delivered robust nlp tools that address challenges like multilingual processing and data privacy. Our natural language processing solutions provide user-friendly interfaces and actionable insights, enhancing decision-making.

Contact us to explore how our natural language processing solutions and nlp tools can transform your data, boosting efficiency and innovation in a data-driven world.

References

Gerosa, Marco, et al. The shifting sands of motivation: Revisiting what drives contributors in open source. 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 2021.

Open Source Tools for Natural Language Processing

Open source software and natural language processing

Unsure which open source NLP tool to use?

Closed source natural language processing tools

Open source NLP projects run by Fast Data Science under MIT License

Clinical Trial Risk Tool: an open source NLP tool for analysing clinical trial protocols

Other open source NLP projects

Drug Named Entity Recognition

Medical Named Entity Recognition

Local Spelling

Citations

Country Named Entity Recognition

Citations

Fast Stylometry

Works citing the Fast Stylometry Python library

Open source NLP tools for general language processing tasks

Which open source NLP tool should I use?

Open data and the FAIR data principles

References

Transform Unstructured Data into Actionable Insights

Open Source Tools for Natural Language Processing

Open source software and natural language processing

Unsure which open source NLP tool to use?

Closed source natural language processing tools

Open source NLP projects run by Fast Data Science under MIT License

Harmony: an open source NLP tool for psychologists and social scientists to analyse and discover text data

Clinical Trial Risk Tool: an open source NLP tool for analysing clinical trial protocols

Other open source NLP projects

Drug Named Entity Recognition

Medical Named Entity Recognition

Local Spelling

Citations

Country Named Entity Recognition

Citations

Fast Stylometry

Works citing the Fast Stylometry Python library

Open source NLP tools for general language processing tasks

Which open source NLP tool should I use?

Open data and the FAIR data principles

References

Transform Unstructured Data into Actionable Insights