Natural Language Processing Data Scientist

Does your company have a large amount of unstructured data, such as unorganised documents? Consider hiring an NLP data scientist to help you extract value from it. Fast Data Science is a data science consultancy offering NLP consulting services.

At Fast Data Science we have a number of data scientists in our team, and our main focus is natural language processing (NLP). The manager, Thomas Wood, studied a Masters in 2008 at Cambridge University in an area of NLP, Computer Speech, Text and Internet Technology, and conducted his research project on pleonastic pronouns using unsupervised learning. Since completing his postgraduate studies he has worked exclusively in data science, maintaining a constant focus on NLP, although he has occasionally worked in computer vision and other areas of data science, including a stint consulting for Tesco, predicting customer purchases. The numerical techniques he has learnt in other disciplines of data science have been incredibly useful in NLP. For example, convolutional neural networks were designed to process image data, but have found a niche for building text classifiers as well as music recommendation systems.

Thomas Wood founded Fast Data Science Ltd in 2018 to deliver data science consultancy focussing on natural language processing problems in large organisations that deal with lots of text data, such as healthcare, pharma, insurance and legal.

A good NLP data scientist is able to perform generalist non-NLP work, such as build a product recommendation system, as well as handle text data. Our team of NLP data scientists has built NLP pipelines from scratch. We have worked on natural language dialogue systems, document classifiers and text-based recommender systems. We use both traditional data science techniques as well as the state of the art NLP data science toolkit which includes neural networks. Python is the tool of choice for an NLP data scientist, due to its abundance of NLP and deep learning libraries – although any language can be used in principle.

Our areas of focus within NLP

NLP sits within data science as a discipline, and we focus on the following areas

  • Natural language dialogue systems, such as Siri, or using modern cloud-based systems such as Microsoft’s LUIS, Amazon’s LEX or Google’s DialogFlow.
  • Text analysis
  • Natural language understanding (NLU)
  • Document anonymisation
  • Clustering and topic analysis of unstructured documents
  • Document classification – how to classify a clinical trial protocol as chemotherapy vs radiotherapy, for example?
  • Document-based recommender systems, such as a CV recommender
  • Unstructured data analysis

NLP and unstructured data

A common problem faced by large organisations in many industries today is the abundance of unstructured data. In fact, the vast majority of data in a company could be unstructured. Vanilla machine learning is only able to extract value from this tiny tip of the iceberg.

NLP data scientist

NLP data scientists are able to tap value from the uncharted 90% of unstructured data that could be floating around a company.

Companies in industries such as healthcare, pharmaceuticals, legal, and insurance, typically have large amounts of unstructured data in text format. These could take the form of unscanned documents, PDFs, HTML, or any other file type, and could be a veritable goldmine of information for an NLP data scientist.

At Fast Data Science we specialise in extracting value from organisations’ unstructured datasets. If you think your organisation’s unstructured dataset could benefit from an NLP data scientist, please get in touch with us.

Natural Language Processing applications in healthcare

Natural Language Processing applications in healthcare Natural Language Processing applications in healthcare

In recent years we have seen natural language processing take off and impact more and more industries. NLP is beginning to revolutionise healthcare in particular. Two of the hottest areas of NLP research are Healthtech and MedTech. NLP data scientists are using NLP to compare and detect changes in clinical reports, evaluate clinical trial protocols, identify molecule names from scientific literature, and extract clinical concepts such as MeSH terms from electronic medical records. These NLP research breakthroughs are beginning to impact the sector. Check out some of our work in healthcare NLP in our portfolio.

Our NLP data scientists have delivered a number of fascinating data science projects in the healthcare sector. Some of these include:

What our NLP Data Scientists do

Our NLP data scientists are used to developing any kind of NLP model, for example:

  • Simple vanilla models, such as Bag of words, tf*idf, cosine similarity. These often serve to provide a baseline performance before progressing to more advanced models.
  • Slightly more sophisticated models, taking word order into account, such as NLP pipelines, lemmatisation, parsers, chunkers.
  • Cutting-edge models such as deep neural networks
  • Clustering and unsupervised techniques
    • Latent Dirichlet Allocation – LDA is useful for extracting topics from a set of unstructured documents, for example, legal documents, survey responses, factory error reports, etc, where there is just an abundance of documents but no accompanying structured data or labels which could make the NLP task easier.
  • Search engines and search term recommendation systems
  • Google Natural Language, AWS, Microsoft Azure
Natural Language Processing word cloud
Topic detection is a technique used by NLP data scientists to explore and discover common themes in a set of unstructured documents such as factory error reports.

Natural Language Processing Data Science

Our data scientists primarily use the following technologies:

  • TensorFlow – deep learning framework best known for neural networks
  • Spacy – a simple Python library allowing quick modelling with deep learning
  • Scikit-Learn
  • Keras – a user friendly wrapper for TensorFlow
  • Python NLTK – Natural language processing toolkit
  • R

Some of our past NLP projects

Our NLP data scientists have worked on a number of large NLP projects for household names, including:

  • a spoken dialogue system for controlling and operating a smart home (“turn on the light in the bathroom when I get home on Tuesday”, for example).
  • an unsupervised NLP model which analysed and clustered text descriptions of manufacturing defects (Boehringer Ingelheim)
  • a model to classify jobseekers’ résumés into industries and salary bands (CV-Library).
  • analysis of survey responses and interactive online dashboard (White Ribbon Alliance)

Please check out our portfolio of case studies, or look at the list of past clients from the top menu, for more information.

Leave a Reply