NLP Scientist

NLP Scientist Explains:

Natural language processing is a subfield of linguistics, sitting on a crossroads with computer science, artificial intelligence and engineering. NLP has been an active field of scientific research since the 1950s. In 1950, Alan Turing published an article titled “Computing Machinery and Intelligence” which proposed what is now called the Turing test as a criterion of intelligence/, a task that involves the automated interpretation and generation of natural language (e.g. “We are searching in the database”). At that time, NLP was not yet seen as its own separate field of science within or separate from artificial intelligence, although now NLP is fully recognised as a science in its own right. NLP scientists maintain their own journals and conferences, and the field has come to prominence in the public eye since the high-profile release of Siri, and other NLP applications which the consumer interacts with on a daily basis.

NLP Scientists will experiment with a variety of model architectures and hyperparameters before choosing a model that performs the best on a given metric

Natural Language Processing Scientists

The scientists at Fast Data Science focus on natural language processing (NLP). The manager, Thomas Wood, studied a Masters in 2008 at Cambridge University in Computer Speech, Text and Internet Technology, and conducted his NLP science project on pleonastic pronouns using unsupervised learning. In other words, he focused on identifying which instances of the word “it” refer to a referent in the previous sentence. When “it” has no referent, such as “it’s raining”, it’s called a pleonastic pronoun. They exist in French and German too, but not in Spanish. Who would have thought that a two-letter word would cause so much trouble? Since then he has been working exclusively in machine learning and mostly in NLP, but he has expanded his horizons beyond the word “it”. In 2018 he founded Fast Data Science Ltd to deliver data science consultancy and NLP scientist expertise. Our team of highly-qualified NLP scientists has built NLP pipelines from scratch, and worked on natural language dialogue systems, document classifiers and text-based recommender systems. For these tasks, we have used both traditional machine learning techniques as well as the state of the art such as neural networks. Our NLP scientists normally use Python.

Fast Data Science - London

Need a business solution?

NLP, ML and data science leader since 2016 - get in touch for an NLP consulting session.

Areas within NLP

NLP scientists focus on a variety of sub-fields of NLP, including:

Natural language understanding
Natural language dialogue systems
Text analysis
Topic analysis – clustering
Document classification
Document-based recommender systems
Unstructured data analysis
Document anonymisation

NLP and unstructured data

Today many companies, in particular in certain industries such as healthcare, pharmaceuticals, legal, and insurance, have large amounts of unstructured data. This is typically data in text format, which may even be unscanned documents, PDFs, HTML, or any other file type.

Unstructured data is very difficult to deal with but can contain a goldmine of information. Fast Data Science specialises in extracting value from organisations’ unstructured datasets. If you have a large document set in your organisation, consider hiring a consultancy of NLP scientists such as Fast Data Science.

Natural Language Processing applications in healthcare

NLP is expanding across all industries, but nowhere is this more the case than in the healthcare sector. Healthtech and MedTech are hot areas of NLP scientific research.

NLP scientists are using NLP to compare and detect changes in clinical reports, extract clinical concepts such as MeSH terms from electronic medical records, and develop human-to-machine natural language dialogue systems to improve the healthcare experience. These NLP research breakthroughs are beginning to impact the sector. Check out some of our healthcare work in our portfolio.

We have worked on a number of NLP research projects in healthcare, including:

a model for the German pharmaceutical company Boehringer Ingelheim to predict the complexity of clinical trials from the trial protocol, a 200-page PDF.
a desktop application to analyse researchers’ outputs, fields, collaborations and affiliations using PubMed search result exports.
a model to identify researchers who have used open sourced molecules in their published research without attribution, also for Boehringer Ingelheim.

Natural Language Processing scientists at Fast Data Science

Our NLP scientists focus on any kind of NLP model. Some examples are:

Bag of words, tf*idf, cosine similarity. These are some of the simplest and best-known NLP models, which mainly ignore word order.
NLP pipelines, lemmatisation, parsers, chunkers. These models allow an NLP scientist to extract value from the order of words in a sentence.
Deep neural networks - the cutting-edge technology used by NLP scientists today
- convolutional neural networks (text as well as images)
- RNN, LSTM
- Seq2seq, word2vec, doc2vec
- see a live demo of a CNN for author identification
Clustering: Latent Dirichlet Allocation
- This is useful for extracting topics from a set of unstructured documents, for example, legal documents, survey responses, factory error reports, etc.
Search engines and search term recommenders
Google Natural Language, AWS, Microsoft Azure

Topic detection is an NLP scientist’s go-to technique that allows you to discover common themes in a set of unstructured documents.

Natural Language Processing in Python and R

We work with the following programming languages and frameworks:

TensorFlow
Keras
Python NLTK
R

Examples of past Natural Language Processing projects

NLP projects we have worked on for major household names include

a spoken dialogue system to control a smart home
an unsupervised text analysis program to analyse text descriptions of manufacturing defects (Boehringer Ingelheim)
a model to classify jobseekers’ CVs into industries and salary bands (CV-Library).
analysis of survey responses (White Ribbon Alliance)