Open Source Tools for Natural Language Processing

Open Source Tools for Natural Language Processing

Open source software and natural language processing

Open source software is software that is made freely available to the public. It is typically developed and maintained by a community of developers who collaborate to improve the software and make it available for anyone to use, ideally with no strings attached.

Open source software is often seen as an alternative to proprietary software, as it is usually free to use and modify. The rules about what you’re allowed to do with open source software (e.g. are you allowed to make money with it) are set out in a document called the licence. Some of the most popular open source licences are the MIT License and the Apache License, which both permit a user to modify software and use it in commercial applications.

Open source software has become increasingly important in natural language processing, as NLP systems are becoming more complex and reaching into more and more fields of our lives, from household uses, to applications in industries such as pharmaceuticals, such as drug discovery, or clinical trial risk management. Open source natural language processing tools allow developers to collaborate to create innovative solutions to problems in natural language processing, and can help to reduce the cost of developing natural language processing systems.

An ideal open source project is a public good and people may contribute to it to gain experience, to add to their portfolio and be more attractive to employers, or because the topic is an area of passion for them. Contributors on open source projects can be motivated by extrinsic factors or pure altruism.[1]

At Fast Data Science we have worked on several open source NLP projects in fields from psychology to pharmaceuticals. All our projects use the MIT License.

You can find the source code of all our open source NLP projects either on our Github account or on the page for the Harmony project.

Fast Data Science - London

Unsure which open source NLP tool to use?

We are developing our own open source NLP tools. We will be glad to advise you on open source NLP software.

Closed source natural language processing tools

It should be pointed out that models such as OpenAI’s GPT API, Google Vertex, and commercial offerings by tech giants are usually closed source. These can be convenient to get started with and may come with dedicated support from the vendor, but usually there will be a cost such as a licence fee.

Open source NLP projects run by Fast Data Science under MIT License

We have taken a lead role in developing a number of open source NLP projects, which are available to the public for personal and commercial use. These are all under the MIT License, which allows anyone to use our code for commercial purposes without an obligation to make their derivative work also open source.

Interested in other open source libraries? Find out about the third party open source libraries that Fast Data Science recommends here!

Harmony: an open source NLP tool for psychologists and social scientists to analyse and discover text data

(Github repo) - Harmony is a tool and research project using natural language processing to harmonise mental health data. Read more at https://harmonydata.ac.uk and try the demo at https://harmonydata.ac.uk/app/. Funded by the Wellcome Trust and adhering to the MIT license and FAIR data principles.

Clinical Trial Risk Tool: an open source NLP tool for analysing clinical trial protocols

(Github repo) - a tool using natural language processing to categorise clinical trial protocols (PDFs) into high, medium or low risk. Read more at https://clinicaltrialrisk.org/ and try the demo at https://app.clinicaltrialrisk.org/.

Clinical Trial Risk Tool
  • Wood TA and McNair D. Clinical Trial Risk Tool: software application using natural language processing to identify the risk of trial uninformativeness. Gates Open Res 2023, 7:56 doi: 10.12688/gatesopenres.14416.1.

Other open source NLP projects

In addition to the externally funded projects above, we have developed some open source NLP tools aimed at Python developers.

Drug Named Entity Recognition

A lightweight Python library for recognising drug names in unstructured text and performing named entity linking to DrugBank IDs.

The Drug Named Entity Recognition package allows you to identify drug names in an English text and find identifiers (MeSH, Drugbank, NHS) and even returns the molecular structure of a drug! It’s also available as a Google Sheets plugin.

Install with:

pip install drug-named-entity-recognition

Local Spelling

Localspelling is a library for localising spelling between US and UK variants. Install from the command line with pip install localspelling

Citations

Country Named Entity Recognition

This is a lightweight Python library for recognising country names in unstructured text and returning Pycountry objects. Tutorial here.

Install with:

pip install country_named_entity_recognition

Cite as:

Citations

Alisa Redding at the University of Helsinki used the tool for her Masters thesis on mass species extinction and biodiversity.

Fast Stylometry

Fast Stylometry is a Python library for forensic stylometry. Read tutorial.

Install with:

pip install faststylometry

Works citing the Fast Stylometry Python library

Open source NLP tools for general language processing tasks

There are many free and open source (FOSS) tools available for natural language processing purposes. Which one you choose for a project depends on the precise use case.

ToolLicenceSummary
spaCyMITspaCy is a great all-round tool for NLP projects. You can use it for rule-based and pattern-based matching, training text classifiers, custom named entity recognition models, embeddings and transformers, and extracting grammatical relations. We are using spaCy inside our Clinical Trial Risk Tool
Natural Language Toolkit (NLTK)Apache 2.0NLTK is a great platform for processing text with Python. It pre-dates neural networks, so it does a lot of “traditional” NLP such as tokenising, stemming, stopwords, dictionaries, etc. It also comes with corpora to import.
Sentence Transformers at HuggingFaceDepends on modelHuggingFace has provided an easy interface to the sentence transformers models, allowing you to run an LLM on your PC in a few steps. We are using HuggingFace Sentence Transformers as the backbone of the Harmony project (more information below).
Scikit-LearnBSD 3-ClauseA fantastic all-round machine learning library in Python which has some really useful simple classifiers such as Naive Bayes, which can be built as part of a pipeline and serialised and deserialised.

Which open source NLP tool should I use?

Here’s an overview of when you might want to use the big tools mentioned (we are only covering Python tools in this article). You can really give your NLP project a head start by choosing the appropriate open source NLP tool as a foundation!

Use caseTool
Language learning app needing to find grammatical structure in sentencesspaCy
Simple low-footprint text classifier e.g. email triage on a serverless app, which doesn’t need to be very sophisticated. Text is in a single language. There are a small number of categories e.g. <10, you may not have a huge amount of data, and the categories are easily separable by presence/absence of key words, e.g. economics vs sport (as opposed to “longitudinal studies in psychology” vs “cohort studies in psychology” which would be much harder to distinguish)Scikit-Learn
Sophisticated text classifier which needs to take into account context of words in sentence; smart AI tool which psychologists can use to compare text data or find similar documentsSentence Transformers at HuggingFace
Analysis of N-grams in a corpus, finding clusters in unstructured documentsNatural Language Toolkit (NLTK): this library has a great implementation of Latent Dirichlet Allocation

Open data and the FAIR data principles

Open data and FAIR data principles are two important concepts in the data sharing and data management world. Open data refers to data that is freely available and accessible to the public. The FAIR data principles are a set of guidelines published in Nature in 2016, aiming to ensure data is Findable, Accessible, Interoperable, and Reusable.

  • Findability: Data should be easy to find, access, and use.
  • Accessibility: Data should be available to everyone who has a legitimate interest in using it.
  • Interoperability: Data should be able to be shared, combined, and compared with other data sets.
  • Reusability: Data should be easy to reuse and repurpose.
  • Accountability: Data should be traceable to its source and users should be held accountable for its use.

References

  1. Gerosa, Marco, et al. The shifting sands of motivation: Revisiting what drives contributors in open source. 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 2021.

What we can do for you

Transform Unstructured Data into Actionable Insights

Contact us