Natural Language Processing

Natural Language Processing and Text Analysis

Our main area of focus is natural language processing (NLP). The manager, Thomas Wood, studied a Masters in 2008 at Cambridge University in Computer Speech, Text and Internet Technology and since then he has been working exclusively in machine learning and mostly in NLP. In 2018, he founded Fast Data Science to deliver data science consultancy, focusing on NLP.

We have built NLP pipelines from scratch, and worked on natural language dialogue systems, document classifiers and text based recommender systems. For these tasks we have used both traditional machine learning techniques as well as the state of the art such as neural networks. We normally use Python for our NLP work.

NLP examples

Example applications of natural language processing include:

Natural language understanding - interpreting a human-written text and converting it to a structured form, for example, identifying the number of participants in a clinical trial from a plain text document.
Natural language dialogue systems such as chatbots or large language models (LLMs) such as GPT-4 or Google BARD.
Text analysis: analysing large corpora of documents.
Topic analysis, topic modelling, or clustering (such as finding common faults in factory error reports: given six months of error logs in plain English, what are the commonest defects that caused the factory to pause production?)
Document classification, such as email triage (is this incoming email high or low priority?), or finding similar documents nlp, and semantic similarity with sentence embeddings.
Predictive models where the independent variable is text (for example, predicting the likelihood of a construction defect resulting in an escalation to a building inspection or litigation).
Document-based recommender systems (product recommendations, dating apps)
Named entity recognition, for example identifying drug names, product names, countries, or company names in text documents.
Unstructured data analysis
Document anonymisation and handling sensitive data

The power of NLP (interactive word2vec graph)

Below you can see a representation of some technical terms used in a dataset of clinical trial documents in 3D space.

Words with similar meanings and usages are close together. Words are colour-coded into clusters which correspond to groups such as diseases (cluster 3), verbs (clusters 1, 6 and 8), etc. If you move the mouse over a word, you can see that word’s cluster number, and the word’s nearest neighbours. A word’s nearest neighbours tend to be words with similar meaning or function, such as synonyms.

This is a demonstration of how natural language processing can be used to find synonyms and common topics in a completely new set of text documents, in totally unsupervised fashion.

The word vectors were calculated in 128 dimensions using word embeddings on Google Cloud Platform and reduced to three dimensions using t-SNE. The words were assigned to 15 clusters using the k-Means clustering algorithm.

Fast Data Science - London

Need natural language processing?

Fast Data Science is a leading company in the natural language processing space - get in touch for an NLP consulting session.

NLP and unstructured data

Today many companies, in particular in certain industries such as healthcare, pharmaceuticals, legal, and insurance, have large amounts of unstructured data. This is typically data in text format, which may even be unscanned documents, PDFs, HTML, or any other file type.

Unstructured data is very difficult to deal with but can contain a goldmine of information. Fast Data Science specialises in extracting value from organisations’ unstructured datasets.

What is NLP? Read more in our blog post on NLP.

Natural Language Processing applications in healthcare

AI and natural language processing are being increasingly adopted across the healthcare sector. This technology is sometimes called healthtech or MedTech. NLP is being used to compare and detect changes in clinical reports, extract clinical concepts such as MeSH terms from electronic medical records, and develop human-to-machine natural language dialogue systems to improve the healthcare experience.

We have worked on a number of projects in healthcare, including:

a model to predict the complexity of clinical trials from the trial protocol for Boehringer Ingelheim.
a desktop application to analyse researchers’ outputs, fields, collaborations and affiliations using PubMed search result exports.
a model to identify researchers who have used open sourced molecules in their published research without attribution, also for Boehringer Ingelheim.

Natural Language Processing technologies at Fast Data Science

We do a lot of natural language processing with Python. We have worked on a variety of NLP models, including:

Bag of words, tf*idf, cosine similarity (ideal for text classification models trained on sparse data or few documents)
NLP pipelines, lemmatisation, parsers, chunkers
Deep neural networks
- convolutional neural networks (CNNs) (text as well as images)
- RNN, LSTM, Transformer models, LLMs (large language models)
- Seq2seq, word2vec, doc2vec
- see a live demo of a CNN for author identification
multilingual natural language processing, including NLP on under-resourced languages
Clustering: Latent Dirichlet Allocation and other unsupervised learning techniques
- This is useful for extracting topics from a set of unstructured documents, for example legal documents, survey responses, factory error reports, etc.
Search engines and search term recommenders
Forensic stylometry, or identifying the author of a document, and fake news detection
Sentiment analysis

Topic detection is an NLP technique that allows you to discover common themes in a set of unstructured documents.

Natural Language Processing in Python and R

We work with whichever frameworks and languages meet the client’s requirements, for example

Google Colab
OpenAI API including GPT-3.5, GPT-4 etc
HuggingFace
TensorFlow
Keras
Python NLTK
R
Google Natural Language, AWS, Microsoft Azure and other third party natural language processing APIs

Examples of past Natural Language Processing projects

NLP projects we have worked on for major household names, multinationals and startups include

a spoken dialogue system to control a smart home
an unsupervised text analysis program to analyse text descriptions of manufacturing defects for Boehringer Ingelheim
a model to classify jobseekers’ CVs into industries and salary bands for for CV-Library.
analysis of survey responses for White Ribbon Alliance