Drug named entity recognition Python library

Recognising drug names in unstructured English text with Python

We have open-sourced a Python library called Drug Named Entity Recognition for finding drug names in a string. For example, “i bought some phenoxymethylpenicillin”. This NLP task is called named entity recognition (finding drug names in text) and named entity linking (mapping drugs to IDs). This is intended for data mining, text mining and other applications of AI in pharma.

Please note Drug Named Entity Recognition finds only high confidence drugs. It also doesn’t find short code names of drugs, such as abbreviations commonly used in medicine, such as “Ceph” for “Cephradin” - as these are highly ambiguous.

Drug Named Entity Recognition is also available as a Google Sheets plugin

Natural language processing consultancy specialising in pharmaceutical and medical data

Want to learn more?

You can also recognise drug names and retrieve drug data within Google Sheets™. Install the free Google Sheets™ plugin and identify drug names in your documents. Or get in touch with Fast Data Science to discuss NLP problems.

Try Plugin

We have a no-code solution where you can use the library directly from Google Sheets!

You can install the plugin in Google Sheets here.

What the Drug Named Entity Recognition Python library does

Drug Named Entity Recognition also only finds the English names of these drugs. Names in the other languages are not supported.

You can install the Python library by typing in the command line:

pip install drug-named-entity-recognition

The source code is on Github and the project is on Pypi.

Are you interested in other kinds of named entity recognition (NER), such as diseases and medical conditions, financial entities, company names, countries, locations, proteins, genes, or molecules?

If your NER problem is common across industries and likely to have been seen before, there may be an off-the-shelf NER tool for your purposes, such as our Medical Named Entity Recognition Python library, or the Country Named Entity Recognition Python library. Dictionary-based named entity recognition is not always the solution, as sometimes the total set of entities is an open set and can’t be listed (e.g. personal names), so sometimes a bespoke trained NER model is the answer. For tasks like finding email addresses or phone numbers, regular expressions (simple rules) are sufficient for the job.

If your named entity recognition or named entity linking problem is very niche and unusual, and a product exists for that problem, that product is likely to only solve your problem 80% of the way, and you will have more work trying to fix the final mile than if you had done the whole thing manually. Please contact Fast Data Science and we’ll be glad to discuss. For example, we’ve worked on a consultancy engagement to find molecule names in papers, and match author names to customers where the goal was to trace molecule samples ordered from a pharma company and identify when the samples resulted in a publication. For this case, there was no off-the-shelf library that we could use.

For a problem like identifying country names in English, which is a closed set with well-known variants and aliases, and an off-the-shelf library is usually available.

For identifying a set of molecules manufactured by a particular company, this is the kind of task more suited to a consulting engagement.

Usage examples

In your Python console, you can try the following:

Example 1

from drug_named_entity_recognition import find_drugs
find_drugs("i bought some Phenoxymethylpenicillin".split(" "))

outputs a list of tuples.

[({'name': 'Phenoxymethylpenicillin',
'synonyms': {'Penicillin', 'Phenoxymethylpenicillin'},
'nhs_url': 'https://www.nhs.uk/medicines/phenoxymethylpenicillin',
'drugbank_id': 'DB00417'},
3,
3)]

Example 2

You can ignore case with:

find_drugs("i bought some phenoxymethylpenicillin".split(" "),
    is_ignore_case=True)

Molecular structures

As of version 2.0, the tool can also retrieve molecular structures:

from drug_named_entity_recognition.drugs_finder import find_drugs
drugs = find_drugs("i bought some paracetamol".split(" "), is_include_structure=True)

this will return the atomic structure of the drug if that data is available.

>>> print (drugs[0][0]["structure_mol"])
316
  Mrv0541 02231214352D          

 11 11  0  0  0  0            999 V2000
    2.3645   -2.1409    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    3.7934    1.1591    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    2.3645    1.1591    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    2.3645    0.3341    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.0790   -0.0784    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.6500   -0.0784    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.0790   -0.9034    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.6500   -0.9034    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.3645   -1.3159    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.0790    1.5716    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.0790    2.3966    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  9  1  0  0  0  0
  2 10  2  0  0  0  0
  3  4  1  0  0  0  0
  3 10  1  0  0  0  0
  4  5  2  0  0  0  0
  4  6  1  0  0  0  0
  5  7  1  0  0  0  0
  6  8  2  0  0  0  0
  7  9  2  0  0  0  0
  8  9  1  0  0  0  0
 10 11  1  0  0  0  0
M  END
DB00316

Fuzzy matching/spelling tolerance

You can get drugs even with spelling mistakes:

drugs = find_drugs("i bought some Monjaro".split(" "), is_include_structure=True, is_fuzzy_match=True)

Add and remove drugs (customise the drugs list)

Now you can modify the drug recogniser’s behaviour if there is a particular drug which it isn’t finding:

To reset the drugs dictionary

from drug_named_entity_recognition.drugs_finder import reset_drugs_data
reset_drugs_data()

To add a synonym

from drug_named_entity_recognition.drugs_finder import add_custom_drug_synonym
add_custom_drug_synonym("potato", "sertraline")

To add a new drug

from drug_named_entity_recognition.drugs_finder import add_custom_new_drug
add_custom_new_drug("potato", {"name": "solanum tuberosum"})

To remove an existing drug

from drug_named_entity_recognition.drugs_finder import remove_drug_synonym
remove_drug_synonym("sertraline")

Compatibility with other natural language processing libraries

The Drug Named Entity Recognition library is independent of other NLP tools and has no dependencies. You don’t need any advanced system requirements and the tool is lightweight. However, it combines well with other libraries such as spaCy or the Natural Language Toolkit (NLTK).

Using Drug Named Entity Recognition together with spaCy

Here is an example call to the tool with a spaCy Doc object:

from drug_named_entity_recognition import find_drugs
import spacy
nlp = spacy.blank("en")
doc = nlp("i routinely rx rimonabant and pts prefer it")
find_drugs([t.text for t in doc], is_ignore_case=True)

outputs:

[({'name': 'Rimonabant', 'synonyms': {'Acomplia', 'Rimonabant', 'Zimulti'}, 'mesh_id': 'D063387', 'drugbank_id': 'DB06155'}, 3, 3)]

Using Drug Named Entity Recognition together with NLTK

You can also use the tool together with the Natural Language Toolkit (NLTK):

from drug_named_entity_recognition import find_drugs
from nltk.tokenize import wordpunct_tokenize
tokens = wordpunct_tokenize("i routinely rx rimonabant and pts prefer it")
find_drugs(tokens, is_ignore_case=True)

Data sources

The main data source is from Drugbank, augmented by datasets from the NHS, MeSH, Medline Plus and Wikipedia.

Update the Drugbank dictionary

If you want to update the dictionary, you can use the data dump from Drugbank and replace the file drugbank vocabulary.csv:

Download the open data dump from https://go.drugbank.com/releases/latest#open-data

Update the Wikipedia dictionary

If you want to update the Wikipedia dictionary, download the dump from Wikimedia and run

python extract_drug_names_and_synonyms_from_wikipedia_dump.py

Update the MeSH dictionary

If you want to update the dictionary, run

and run

python download_mesh_dump_and_extract_drug_names_and_synonyms.py

If the link doesn’t work, download the open data dump manually from https://www.nlm.nih.gov/. It should be called something like desc2023.xml. And comment out the Wget/Curl commands in the code.

License information for external data sources

Data from Drugbank is licensed under CC0.

To the extent possible under law, the person who associated CC0 with the DrugBank Open Data has waived all copyright and related or neighboring rights to the DrugBank Open Data. This work is published from: Canada.

Text from Wikipedia data dump is licensed under GNU Free Documentation License and Creative Commons Attribution-Share-Alike 3.0 License. More information.

Raising issues

If you find a problem, you are welcome either to raise an issue at https://github.com/fastdatascience/drug_named_entity_recognition/issues

Who worked on the Drug Named Entity Recognition library?

The tool was developed:

Thomas Wood (Fast Data Science)

License of Drug Named Entity Recognition library

Citing the Drug Named Entity Recognition library

Wood, T.A., Drug Named Entity Recognition [Computer software], Version 1.0.3, accessed at https://fastdatascience.com/drug-named-entity-recognition-python-library, Fast Data Science Ltd (2024)

@unpublished{drugnamedentityrecognition,
    AUTHOR = {Wood, T.A.},
    TITLE  = {Drug Named Entity Recognition (Computer software), Version 1.0.3},
    YEAR   = {2024},
    Note   = {To appear},
    url = {https://zenodo.org/doi/10.5281/zenodo.10970631},
    doi = {10.5281/zenodo.10970631}
}

Case studies: Who is using the Drug Named Entity Recognition Library?

Thankfully, a number of people and organisations around the world have been using the library and have cited us.

NeuroTrialNER: An Annotated Corpus for Neurological Diseases and Therapies in Clinical Trial Registries

Simona E. Doneva at the University of Zurich and her colleagues used the tool to make an annotated corpus of neurological diseases.

Doneva, Simona, et al. NeuroTrialNER: An Annotated Corpus for Neurological Diseases and Therapies in Clinical Trial Registries. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024.