NLP Data Scientist

An NLP Data Scientist Explains:

Natural language processing (NLP) sits on a crossroads between data science, linguistics, computer science, and artificial intelligence. It is the science of understanding and processing interactions between computers and human language. Today most data scientists operate within the broader area of machine learning, and NLP can be seen as a speciality sitting within data science - whereas in the past NLP was often seen as a subfield of linguistics and referred to as ‘computational linguistics’.

Fast Data Science offers bespoke NLP data science consulting. We can provide a one-off NLP consultation, or even an NLP data scientist on retainer. Please get in contact today to discuss your NLP data science needs.

An NLP data scientist today will often work within or alongside a team of generalist data scientists in a company, who will handle the day-to-day non-text data science problems that occur. Whereas a generalist data scientist will apply machine learning problems to numerical data, NLP data scientists will also handle data in text format.  This adds an additional layer of complexity and means that NLP data scientists are more and more in demand.

For example, a pharma company may need a data scientist to mine in-house text data to further understand the next generation of drugs and medicines, or understand medical reports.

When Alan Turing published his ground-breaking article titled “Computing Machinery and Intelligence” in 1950, proposing what is now called the Turing test as a criterion of intelligence/, NLP was not yet seen as its own separate field of science within or separate from artificial intelligence. Today, NLP is fully recognised as a science in its own right and in many industries NLP data scientists are an essential part of any company.

NLP Data Scientist

An NLP Data Scientist follows a similar scientific procedure to a generalist data scientist, experimenting with model architectures and hyperparameters before choosing a final NLP model

Natural Language Processing Data Scientist

Does your company have a large amount of unstructured data, such as unorganised documents? Consider hiring an NLP data scientist to help you extract value from it. Fast Data Science is a data science consultancy offering NLP consulting services. At Fast Data Science we have a number of data scientists in our team, and our main focus is natural language processing (NLP). The manager, Thomas Wood, studied a Masters in 2008 at Cambridge University in an area of NLP, Computer Speech, Text and Internet Technology, and conducted his research project on pleonastic pronouns using unsupervised learning. Since completing his postgraduate studies he has worked exclusively in data science, maintaining a constant focus on NLP, although he has occasionally worked in computer vision and other areas of data science, including a stint consulting for Tesco, predicting customer purchases. The numerical techniques he has learnt in other disciplines of data science have been incredibly useful in NLP. For example, convolutional neural networks were designed to process image data, but have found a niche for building text classifiers as well as music recommendation systems. Thomas Wood founded Fast Data Science Ltd in 2018 to deliver data science consultancy focussing on natural language processing problems in large organisations that deal with lots of text data, such as healthcare, pharma, insurance and legal. A good NLP data scientist is able to perform generalist non-NLP work, such as build a product recommendation system, as well as handle text data. Our team of NLP data scientists has built NLP pipelines from scratch. We have worked on natural language dialogue systems, document classifiers and text-based recommender systems. We use both traditional data science techniques as well as the state of the art NLP data science toolkit which includes neural networks. Python is the tool of choice for an NLP data scientist, due to its abundance of NLP and deep learning libraries - although any language can be used in principle.

Fast Data Science - London

Need a business solution?

NLP, ML and data science leader since 2016 - get in touch for an NLP consulting session.

Our areas of focus within NLP

NLP sits within data science as a discipline, and we focus on the following areas

  • Natural language dialogue systems, such as Siri, or using modern cloud-based systems such as Microsoft’s LUIS, Amazon’s LEX or Google’s DialogFlow.
  • Text analysis
  • Natural language understanding (NLU)
  • Document anonymisation
  • Clustering and topic analysis of unstructured documents
  • Document classification - how to classify a clinical trial protocol as chemotherapy vs radiotherapy, for example?
  • Document-based recommender systems, such as a CV recommender
  • Unstructured data analysis

NLP and unstructured data

A common problem faced by large organisations in many industries today is the abundance of unstructured data. In fact, the vast majority of data in a company could be unstructured. Vanilla machine learning is only able to extract value from this tiny tip of the iceberg.

NLP data scientist NLP data scientists are able to tap value from the uncharted 90% of unstructured data that could be floating around a company.

Companies in industries such as healthcare, pharmaceuticals, legal, and insurance, typically have large amounts of unstructured data in text format. These could take the form of unscanned documents, PDFs, HTML, or any other file type, and could be a veritable goldmine of information for an NLP data scientist. At Fast Data Science we specialise in extracting value from organisations’ unstructured datasets. If you think your organisation’s unstructured dataset could benefit from an NLP data scientist, please get in touch with us.

Natural Language Processing applications in healthcare

Natural Language Processing applications in healthcare Natural Language Processing applications in healthcare

In recent years we have seen natural language processing take off and impact more and more industries. NLP is beginning to revolutionise healthcare in particular.

Two of the hottest areas of NLP research are Healthtech and MedTech. NLP data scientists are using NLP to compare and detect changes in clinical reports, evaluate clinical trial protocols, identify molecule names from scientific literature, and extract clinical concepts such as MeSH terms from electronic medical records.

These NLP research breakthroughs are beginning to impact the sector. Check out some of our work in healthcare NLP in our portfolio.

Our NLP data scientists have delivered a number of fascinating data science projects in the healthcare sector. Some of these include:

What our NLP Data Scientists do

Our NLP data scientists are used to developing any kind of NLP model, for example:

  • Simple vanilla models, such as Bag of words, tf*idf, cosine similarity. These often serve to provide a baseline performance before progressing to more advanced models.
  • Slightly more sophisticated models, taking word order into account, such as NLP pipelines, lemmatisation, parsers, chunkers.
  • Cutting-edge models such as deep neural networks
    • convolutional neural networks (CNNs; text as well as images)
    • RNN, LSTM
    • BERT, ELMO
    • Seq2seq, word2vec, doc2vec
  • see a live demo of a CNN for author identification
  • Clustering and unsupervised techniques
    • Latent Dirichlet Allocation - LDA is useful for extracting topics from a set of unstructured documents, for example, legal documents, survey responses, factory error reports, etc, where there is just an abundance of documents but no accompanying structured data or labels which could make the NLP task easier.
  • Search engines and search term recommendation systems
  • Google Natural Language, AWS, Microsoft Azure

Topic detection is an NLP technique that allows you to discover common themes in a set of unstructured documents. Topic detection is a technique used by NLP data scientists to explore and discover common themes in a set of unstructured documents such as factory error reports.

Natural Language Processing Data Science

Our data scientists primarily use the following technologies:

  • TensorFlow - deep learning framework best known for neural networks
  • Spacy - a simple Python library allowing quick modelling with deep learning
  • Scikit-Learn
  • Keras - a user friendly wrapper for TensorFlow
  • Python NLTK - Natural language processing toolkit
  • R

Some of our past NLP projects

Our NLP data scientists have worked on a number of large NLP projects for household names, including:

  • a spoken dialogue system for controlling and operating a smart home (“turn on the light in the bathroom when I get home on Tuesday”, for example).
  • an unsupervised NLP model which analysed and clustered text descriptions of manufacturing defects (Boehringer Ingelheim)
  • a model to classify jobseekers’ résumés into industries and salary bands (CV-Library).
  • analysis of survey responses and interactive online dashboard (White Ribbon Alliance)

Please check out our portfolio of case studies, or look at the list of past clients from the top menu, for more information.

What we can do for you

Transform Unstructured Data into Actionable Insights

Contact us