Your Guide to Natural Language Processing (NLP)

wordcloud min
A word cloud generated from this article.

With the rise of artificial intelligence, automation is becoming a part of everyday life. Natural Language Processing (NLP) has proven to be a key part of this breakthrough. Natural Language Processing bridges the gap between computers, AI, and computational linguistics.

A simple sentence spoken by humans consists of different tones, words, meanings, and values. Expert AI systems are able to leverage these hidden structures and meanings to understand human behaviour. However, we often need a very close and detailed assessment to conclude what meaning might and might not be correct. When we have a large amount of text data, it can become impossible to read quickly.

Raw text data in English or other languages is an example of unstructured data. This kind of data does not fit into a relational database and is hard to interpret with computer programs. Natural Language Processing is a sub-field of AI which deals with how computers interpret, comprehend, and manipulate human language.

Different approaches to Natural Language Processing

NLP is much more than speech and text analysis. Depending on what needs to be done, there can be different approaches. There are three primary approaches:

Statistical Approach: The statistical approach to natural language processing depends on finding patterns in large volumes of text. By recognising these trends, the system can develop its own understanding of human language. Some of the more cutting edge examples of statistical NLP include deep learning and neural networks.

Symbolic Approach: The symbolic approach towards NLP is more about human-developed rules. A programmer writes a set of grammar rules to define how the system should behave.

Connectionist Approach: The third approach is a combination of statistical and symbolic approaches. We start with a symbolic approach and strengthen it with statistical rules.

Now that you know the various approaches used in NLP, let’s see how the language is interpreted in a way that machine can understand it better.

How is Language Interpreted?

Language interpretation can be divided into multiple levels. Each level allows the machine to extract information at a higher level of complexity.

Morphological Level: Within a word, a structure called the morpheme is the smallest unit of meaning. The word ‘unlockable’ is made of three morphemes: un+lock+able. Similarly, ‘happily’ is made of two: happy+ly. Morphological analysis involves identifying the morphemes within a word in order to get at the meaning.

Morphemes 2
A Chinese word, an English word, and a Turkish word. Some languages, such as Mandarin, have one or two morphemes per word, and others, such as Turkish, can have many morphemes per word. English is somewhere in the middle. The example shown of ‘unlockable’ can be analysed as either un+lockable or unlock+able, which illustrates the inherent ambiguity of many of the analyses we run in NLP.

Lexical Level: The next level of analysis involves looking at words at a whole.

Syntactic Level: Taking a step forward, syntactic analysis focuses on the structure of a sentence: how words interact with each other.

parse tree 1
A parse tree is one way we can represent the syntax of a sentence.

Semantic Level: Semantic analysis deals with how we can convert the sentence structure into an interpretation of meaning.

Discourse Level: Here we are dealing with the connection between two sentences. This is incredibly tricky: if somebody says ‘he’, how do we know who they are referring to from a past sentence, if multiple persons were mentioned?

Now that we have a clear understanding of Natural Language Processing, here are some common examples of NLP:

Applications of NLP

Social Media Monitoring

pexels pixabay 267350 min
Out-of-control social media can be damaging for a brand.

One of the best examples of Natural Language Processing is social media monitoring. Negative publicity is not good for a brand and a good way to know what your customers think is by keeping an eye on social media.

Platforms like Buffer and Hootsuite use NLP technology to track comments and posts about a brand. NLP helps alert companies when a negative tweet or mention goes live so that they can address a customer service problem before it becomes a disaster.

Sentiment analysis

pexels moose photos 1587014 min

While standard social media monitoring deals with written texts, with sentiment analysis techniques we can take a deeper look at the emotions of the user.

The user’s choice of words gives a hint as to how the user was feeling when they wrote the post. For example, if they use words such as happy, good, and praise, then it indicates a positive feeling. However, sentiment analysis is far from straightforward: it can be thrown by sarcasm, double entendres, and complex sentence structure, so a good sentiment analysis algorithm should take sentence structure into account.

Companies often use sentiment analysis to observe customers’ reactions towards their brands whenever something new is implemented.

Text analysis

Simple texts can hold deep meanings and can point towards multiple subcategories. Mentions of locations, dates, locations, people, and companies can provide valuable data. The most powerful models are often very industry-specific and developed by companies with large amounts of data in their domain. A machine learning model can be trained to predict the salary of a job from its description, or the risk level of a house or marine vessel from a safety inspection report.

One cool application is forensic stylometry, which is the science of determining the author of a document based on the writing style. I’ve trained a simple forensic stylometry model to identify which of three famous authors is likely to have written a text. You can try it out here.

Sherlock Holmes Portrait Paget
Forensic stylometry is an NLP technique that allows us to play detective to identify the author of a ghostwritten novel, anonymous letter, or ransom note. Image of Sherlock Holmes is in the public domain.

Healthcare and NLP

Natural Language Processing has also been helping with disease diagnosis, care delivery, and bringing down the overall cost of healthcare. NLP can help doctors to analyse electronic health records, and even begin to predict disease progression based on the large amount of text data detailing an individual’s medical history.

Amazon Comprehend Medical is using NLP to gather data on disease conditions, medication, and outcomes from clinical trials. Such ventures can help in the early detection of disease. Right now, it is being used for several health conditions including cardiovascular disease, schizophrenia, and even anxiety.

Cognitive Assistant

IBM has recently developed a cognitive assistant that acts like a personal search engine. It knows detailed information about a person and then when prompted provides the information to the user. This is positive step toward helping people with memory problems.

We have seen how helpful Natural Language Processing can be. How does it work in detail? Here we give a brief look into the algorithms.

Some basic approaches to NLP

Bag of Words

Bag of Words is the simplest model in NLP. When we run a bag of words analysis, we disregard the word order, grammar, and semantics. We simply count all the words in a document and feed these numbers into a machine learning algorithm.

For example, if we are building a model to classify news articles into either sport or finance, we might calculate the bag of words score for two articles as follows:

WordArticle AArticle B
interest31
goal12
football02
bank20
Example bag of words score for two articles which we want to assign to sport or finance.

Looking at the above example, it should be easy to identify which article belongs to which category.

The main drawback of the bag of words method is that we are throwing away a lot of useful information which is contained in the word order. For this reason, bag of words is not widely used in production systems in practice.

Tokenisation

Tokenisation is often the first stage of an NLP model. A document is split up into pieces to make them easier to handle. Often, each word is a token, but this is not always the case, and tokenisation has to know not to separate phone numbers, email addresses, and the like.

For example, below are the tokens for the example sentence “When will you leave for England?”

WhenwillyouleaveforEngland?

This tokenisation example seems simple because the sentence could be split on spaces. However, not all languages use the same rules to divide words. For many East Asian languages such as Chinese, tokenisation is very difficult because no spaces are used between words, and it’s hard to find where one word ends and the next word starts. German can also be difficult to tokenise because of compound words which can be written separately or together depending on their function in a sentence.

Tokenization is also not effective for some words like “New York”. Both New and York can have different meanings so using a token can be confusing. For this reason, tokenisation is often followed by a stage called chunking where we re-join multi-word expressions that were split by a tokeniser.

Tokenisation can be unsuitable for dealing with text domains that contain parentheses, hyphens, and other punctuation marks. Removing these details jumbles the terms. To solve these problems, the next methods shown below are used in combination with tokenisation.

Stop Word Removal

After tokenisation, it’s common to discard stop words, which are pronouns, prepositions, and common articles such as ‘to’ and ‘the’. This is because they often contain no useful information for our purposes and can safely be removed. However, stop word lists should be chosen carefully, as a list that works for one purpose or industry may not be correct for another.

To make sure that no important information is excluded in the process, typically a human operator creates the list of stop words.

Stemming

Stemming is the process of removing the affixes. This includes both prefixes and suffixes form the words.

Suffixes appear at the end of the word. Examples of suffixes are “-able”, “-acy”, “-en”, “-ful”. Words like “wonderfully” are converted to “wonderful”.

Prefixes appear in front of a word. Some of the common examples of prefixes are “hyper-“, “anti-“, “dis-“, “tri-“, “re-“, and “uni-“.

To perform stemming, a common list of affixes is created, and they are removed programmatically from the words in the input. Stemming should be used with caution as it may change the meaning of the actual word. However, stemmers are easy to use and can be edited very quickly. A common stemmer used in English and other languages is the Porter Stemmer.

Lemmatisation

Lemmatisation has a similar goal to stemming: the different forms of a word are converted to a single base form. The difference is that lemmatisation relies on a dictionary list. So “ate”, “eating”, and “eaten” are all mapped to “eat” based on the dictionary, while a stemming algorithm would not be able to handle this example.

Lemmatisation algorithms ideally need to know the context of a word in a sentence, as the correct base form could depend if the word was used as a noun or verb, for example. Furthermore, word sense disambiguation may be necessary in order to distinguish identical words with different base forms.

Neural networks

pexels markus spiske 1089438 small min

For many decades, researchers tried to process natural language text by writing ever more complicated series of rules. The problem with the rule-based approach is that the grammar of English and of other languages is idiosyncratic and does not conform to any fixed set of rules.

For example, for one recent project in the pharma space, I tried to write a series of rules to extract the number of participants from a clinical trial protocol. I found examples like

We recruited 450 participants

The number of participants was N=231

The initial intention was to recruit 45 subjects, however due to dropouts the final number was 38

You can see how difficult it would be to write a set of instructions for a computer on where to find the correct number.

So with this kind of problem, it’s often more sensible to let the computer do the heavy lifting. If you have several thousand documents, and you know the true number of participants in each of these documents (perhaps the information is available in an external database), then a neural network can learn to find the patterns itself, and recognise the number of subjects in a new unseen document.

I believe that this is the way forward, and some of the traditional NLP techniques will be used less and less, as computing power becomes more widely available and the science advances.

Common neural networks used for NLP include LSTM, Convolutional Neural Networks, and Transformers.

Conclusion

The field of natural language processing has been moving forward in the last few decades and has opened some meaningful ways to an advanced and better world. While there are still challenges in decoding different languages and dialects used around the world, the technology continues to improve at a rapid pace.

NLP has already found applications in finding healthcare solutions and helping companies meet their customers’ expectations. We can expect to see natural language processing affecting our lives in many more unexpected ways in the future.

Generative Adversarial Networks Made Easy

A fake face generated by a generative adversarial network StyleGAN
Would you hire or date this person? There’s a catch: she doesn’t exist! I generated the image in a few seconds using the software StyleGAN. That’s why you can see some small artefacts in the image if you look carefully.

Human or AI?

Imagine this scenario: you have encountered a profile online of a good-looking person. They might have contacted you about a job, or on a social media site. You might even have swiped right on their face on Tinder.

There is just one little problem. This person may not even exist. The image could have been generated using a machine learning technique called Generative Adversarial Networks, or GANs. GANs were developed in 2014 and have recently experienced a surge in popularity. They have been touted as one of the most groundbreaking ideas in machine learning in the past two decades. GANs are used in art, astronomy, and even video gaming, and are also taking the legal and media world by storm. 

Generative Adversarial Networks are able to learn from a set of training data, and generate new synthetic data with the same characteristics as the training set. The best-known and most striking application is for image style transfer, where GANs can be used to change the gender or age of a face photo, or re-imagine a painting in the style of Picasso. GANs are not limited to just images: they can also generate synthetic audio and video.

Can we also use generative adversarial networks for natural language processing – to write a novel, for example? Read on and find out.

I’ve included links at the end of the article so you can try all the GANs featured yourself.

style mixing min 1
A generative adversarial network allows you to change parameters and adjust and control the face that you are generating. I generated this series of faces with StyleGAN.

Invention of Generative Adversarial Networks

The American Ian Goodfellow and his colleagues invented Generative Adversarial Networks in 2014 following some ideas he had during his PhD at the University of Montréal. They entered the public eye around 2016 following a number of high profile stories around AI art and the impact on the art world.

A Game of Truth or Lie?

How does a generative adversarial network work? In fact, the concept is quite similar to playing a game of ‘truth-or-lie’ with a friend: you must make up stories, and your friend must guess if you’re telling the truth or not. You can win the game by making up very plausible lies, and your friend can win if they can sniff out the lies correctly.

A Generative Adversarial Network consists of two separate neural networks:

  • The generator: this is a neural network which takes some random numbers as input, and tries to generate realistic fake data, such as fake images.
  • The discriminator: this is a neural network with a simple task: it must spot the discriminator’s fakes and distinguish them from real ones.

The two networks are trained together but must work against each other, hence the name ‘adversarial’. If the discriminator doesn’t recognise a fake as such, it loses a point. Likewise, the generator loses a point if the discriminator can correctly distinguish the real images from the fake ones.

A clip from the British panel show Would I Lie To You, where a contestant must either tell the truth or invent a plausible lie, and the opposing team must guess which it is. Over time the contestants get better at lying convincingly and at distinguishing lies from truth. The initial contestant is like the ‘generator’ in a Generative Adversarial Network, and the opponent is the ‘discriminator’.

How Generative Adversarial Networks Learn

So how does a generative adversarial network learn to generate such realistic fake content?

As with all neural networks, we initialise the generator and discriminator with completely random values. So the generator produces only noise, and the discriminator has no clue how to distinguish anything.

Let us imagine that we want a generative adversarial network to generate handwritten digits, looking like this:

mnist 3.0.1 min
Some examples of handwritten digits from the famous MNIST dataset.

When we start training a generative adversarial network, the generator only outputs pure noise:

At the start of training, a generative adversarial network outputs white noise.
The output image of a GAN before training starts

At this stage, it is very easy for the discriminator to distinguish noise from handwritten numbers, because they look nothing alike. So at the start of the “game”, the discriminator is winning.

After a few minutes of training, the generator begins to output images that look slightly more like digits:

After a few epochs, a generative adversarial network starts to output more realistic digits
After a few epochs, a generative adversarial network starts to output more realistic digits.

After a bit longer, the generator’s output becomes indistinguishable from the real thing. The discriminator can’t tell real examples apart from fakes any more.

Applications

Generating Face Images

Generative Adversarial Networks are best known for their ability to generate fake images, such as human faces. The principle is the same as for handwritten digits in the example shown above. The generator learns from a set of images which are usually celebrity faces, and generates a new face similar to the faces it has learnt before.

A set of faces generated by the generative adversarial network StyleGAN, developed by NVidia.
A set of faces generated by the generative adversarial network StyleGAN, developed by NVidia.

Interestingly, the generated faces tend to be quite attractive. This is partly due to the use of celebrities as a training set, but also because the GAN performs a kind of averaging effect on the faces that it’s learnt from, which removes asymmetries and irregularities.

Image Style Transfer

As well as generating random images, generative adversarial networks can be used to morph a face from one gender to another, change someone’s hairstyle, or transform various elements of a photograph.

For example, I tried running the code to train the generative adversarial network CycleGAN, which is able to convert horses to zebras in photographs and vice versa. After about four hours of training, the network begins to be able to turn a horse into a zebra (the quality isn’t that great here as I didn’t run the training for very long, but if you run CycleGAN for several days you can get a very convincing zebra).

A horse which will be transformed to a zebra by the generative adversarial network CycleGANA horse transformed to a zebra by the generative adversarial network CycleGAN

Music

It’s possible to convert an audio file into an image by representing it as a spectrogram, where time is on one axis and pitch is on the other.

spectrogram min
The spectrogram of Beethoven’s Military March

An alternative method is to treat the music as a MIDI file (the output you would get from playing it on an electronic keyboard), and then transform that to a format that the GAN can handle. Using simple transformations like this, it’s possible to use GANs to generate entirely new pieces of music in the style of a given composer, or to morph speech from one speaker’s voice to another.

The generative adversarial network GANSynth allows us to adjust properties such as the timbre of a piece of music.

Here’s Bach’s Prelude Suite No. 1 in G major:

Bach’s Prelude Suite No. 1 in G major.

And here is the same piece of music with the timbre transformed by GANSynth:

Bach’s Prelude Suite No. 1 in G major with an interpolated timbre, generated by GANSynth.

Generative Adversarial Networks for Natural Language Processing?

After seeing the amazing things that generative adversarial networks can achieve for images, video and audio, I started wondering whether a GAN could write a novel, a news article, or any other piece of text.

I did some digging and found that Ian Goodfellow, the inventor of Generative Adversarial Networks, wrote in a post on Reddit back in 2016 that GANs can’t be used for natural language processing, because GANs require real-valued data.

An image, for example, is made up of continuous values. You can make a single pixel a touch lighter or darker. A GAN can learn to improve its images by making small adjustments. However there is no analogous continuous value in text. According to Goodfellow,

If you output the word “penguin”, you can’t change that to “penguin + .001” on the next step, because there is no such word as “penguin + .001”. You have to go all the way from “penguin” to “ostrich”.

Since all NLP is based on discrete values like words, characters, or bytes, no one really knows how to apply GANs to NLP yet.

Ian Goodfellow, posting on Reddit in 2016

However since Ian Goodfellow wrote this quote, a number of researchers have succeeded in adapting generative adversarial networks for text.

A Chinese team (Yu et al) has developed a generative adversarial network which they used to generate classical Chinese poems, which are made up of lines of four characters each. They found that independent judges were unable to tell the generated poems from real ones.

They then tried it out on Barack Obama’s speeches and were able to generate some very plausible-sounding texts, such as:

Thank you so much. Please, everybody, be seated. Thank you very much. You’re very kind. Thank you.

I´m pleased in regional activities to speak to your own leadership. I have a preexisting conditions. It is the same thing that will end the right to live on a high-traction of our economy. They faced that hard work that they can do is a source of collapse. This is the reason that their country can explain construction of their own country to advance the crisis with possibility for opportunity and our cooperation and governments that are doing. That’s the fact that we will not be the strength of the American people. And as they won’t support the vast of the consequences of your children and the last year. And that’s why I want to thank Macaria. America can now distract the need to pass the State of China and have had enough to pay their dreams, the next generation of Americans that they did the security of our promise. And as we cannot realize that we can take them.

And if they can can’t ensure our prospects to continue to take a status quo of the international community, we will start investing in a lot of combat brigades. And that’s why a good jobs and people won’t always continue to stand with the nation that allows us to the massive steps to draw strength for the next generation of Americans to the taxpayers. That’s what the future is really man, but so we’re just make sure that there are that all the pressure of the spirit that they lost for all the men and women who settled that our people were seeing new opportunity. And we have an interest in the world.

Now we welcome the campaign as a fundamental training to destroy the principles of the bottom line, and they were seeing their own customers. And that’s why we will not be able to get a claim of their own jobs. It will be a state of the United States of America. The President will help the party to work across our times, and here in the United States of uniform. But their relationship with the United States of America will include faith.

Thank you. God bless you. And May God loss man. Thank you very much. Thank you very much, everybody. Thank you. God bless the United States of America. God bless you. Here’s President.

A generated Barack Obama-esque speech, by Yu et al (2017)

Generative Adversarial Networks in Society

Deepfakes

GANs have received substantial attention in the mainstream media because of their part in the controversial ‘deepfakes’ phenomenon. Deepfakes are realistic-looking synthetic images or videos of politicians and other public figures in compromising situations. Malicious actors have created highly convincing footage of people doing or saying things they have never actually done or said.

It has always been possible to Photoshop celebrities or politicians into fake backdrops, or show these people hugging or shaking hands with a person that they have never seen in person. The Soviet apparatus was notorious for airbrushing out-of-favour figures out of photographs in a futile attempt to rewrite history. Generative adversarial networks have taken this one step further by making it possible to create apparently real video footage.

A digitally retouched photograph from the Soviet era. Who knows what the authoritarian state could have achieved with generative adversarial networks?
A digitally retouched photograph from the Soviet era. Who knows what the authoritarian state could have achieved with generative adversarial networks? Image is in the public domain.

This is an existential threat to the news media, where the credibility of the content is key. How can we know whether a whistle-blower’s hidden camera clip is real, or is it an elaborate fake created by a GAN to destroy the opponent’s reputation?​​ Deepfakes can also be used to add credibility to fake news articles.

The technology poses dark problems. GAN-enabled pornography has appeared on the Internet, created using the faces of real celebrities. Celebrities are currently an easy target because there are already many photos of them on the Internet, making it easy to train a GAN to generate their faces. Furthermore, the public’s interest in their personal lives is already high, so it can be lucrative to post fake videos or photos. However, as technology advances and the size of the required training set shrinks, hackers can use blackmail to make fake clips featuring nearly anybody.

AI Art

Even bona fide uses of generative adversarial networks raise some complicated legal questions. For example, who owns the rights to an image created by a generative adversarial network?

United States copyright law requires a copyrighted work to have a human author. But who owns the rights to an image generated by a GAN? The software engineer? The person who used the GAN? Or the owner of the training data?

The concept of ‘who is the creator’ was famously put to the test in 2018, when the Parisian arts collective Obvious used a generative adversarial network to create a painting called Edmond de Belamy, which was later printed onto canvas. The artwork sold at Christie’s New York for $432,500. However, it soon emerged that the code to generate the painting had been written by another AI artist, Robbie Barratt, who was not affiliated with Obvious. Public opinion was divided as to whether the three artists in Obvious could rightfully claim to have created the artwork.

The GAN-generated painting Edmond de Belamy, printed on canvas but created using a generative adversarial network by the Parisian collective Obvious.
The GAN-generated painting Edmond de Belamy, printed on canvas but created using a generative adversarial network by the Parisian collective Obvious. Image is in the public domain.

Future of Generative Adversarial Networks

Generative Adversarial Networks are a young technology but in a short time they have had a large impact on the world of deep learning and also on society’s relationship with AI. So far, the various exotic applications of GANs are only beginning to be explored.

Currently, generative adversarial networks do not yet have widespread use in data science in industry, so we can expect GANs to spread out from academia in the near future. I expect GANs to become widely used in computer gaming, animation, and the fashion industry. A Hong Kong-based biotechnology company called Insilico Medicine is beginning to explore GANs for drug discovery. Companies such as NVidia are investing heavily in research in GANs and also in more powerful hardware, so the field looks promising. And of course, we can expect to hear a lot more about GANs and AI art following the impact of Edmond de Belamy.

Links to get started with Generative Adversarial Networks

If you want to run any of the generative adversarial networks that I’ve shown in the article, I’ve included some links here. Only the first one (handwritten digits) will run on a regular laptop, while the others would need you to create an account with a cloud provider such as AWS or Google Colab, as they need more powerful computing.

Further Reading about Generative Adversarial Networks

References