Semantic similarity with sentence embeddings

· Thomas Wood
Semantic similarity with sentence embeddings

In natural language processing, we have the concept of word vector embeddings and sentence embeddings. This is a vector, typically hundreds of numbers, which represents the meaning of a word or sentence.

Embeddings are useful because you can calculate how similar two sentences are by converting them both to vectors, and calculating a distance metric.

You can see below how two sentences can be converted to a vector, and we can measure the distance between them. Try playing with a few sentences in the Fast Data Science Sentence Embeddings Visualiser below - it’s really interesting to get some intuition for how vector embeddings behave. It may take about 30 seconds to load the transformer model. The model runs in your browser, which means it’s not the biggest or most powerful model out there!

Fast Data Science Sentence Embeddings Visualiser


2D vector projection

Because we only have two vectors, we can display them on a 2D screen with the same angle between them as in the original 512-dimensional space!
[Click 'Calculate vectors' to see the similarity and visualization]

Original 512-dimensional vector data

Vector 1 (512D full vector):

--

Vector 2 (512D full vector):

--

Which sentence embedding model does this code use?

The above example uses the Universal Sentence Encoder lite in Tensorflow JS, which runs in your browser and doesn’t send anything to a server. It uses vectors of size 512 (512-dimensional embeddings).

Why would we want to use sentence embeddings?

Before embeddings, if you wanted to compare documents or texts to work out how similar they were, the easiest way was to count the words in common. This clearly falls over when documents don’t share any words in common, but use synonyms. You can read about ways of comparing text in our blog post on finding similar documents in NLP.

Sentence embeddings mean that your entire document set can be converted to a set of vectors and stored like that, and any new document can be quickly compared to the ones in the index.

Sentence embeddings for data harmonisation in research

In the Harmony project, we’ve developed an online tool which allows psychologists to compare questionnaire items semantically to identify common questions among questionnaires. Harmony calculates that a question such as “I feel nervous” might be 78% similar to one such as “I feel anxious”. This value is just the cosine similarity metric (the similarity between two vectors) expressed as a percentage!

*You can try the Harmony app at harmonydata.ac.uk/app.

Sentence embeddings for retrieval augmented generation (RAG)

Sentence embeddings are often used in Retrieval Augmented Generation (RAG) systems: if you want to use a generative model such as ChatGPT, but give it domain specific knowledge, you can use sentence embeddings to work out which bit of your knowledge base is most relevant to a user’s query.

We have used RAG in the Insolvency Bot project, a chatbot with knowledge of English and Welsh insolvency law. When the user asks the Insolvency Bot a question, we convert it to an embedding, and we might identify that their question is about cross-border insolvency. We then send the parts of the Insolvency Act 1986 which are relevant to cross-border insolvency, together with the user’s query, to OpenAI, and retrieve a bot response which is much better than what GPT would have done on its own without any additional context.

How is the distance or similarity calculated between two sentence embedding vectors?

Two of the commonest ways of calculating the similarity between two sentence or word embedding vectors are the Euclidean distance and the cosine similarity. These are easier to understand in two dimensions.

Let’s imagine you want to compare two dissimilar sentences, and two similar sentences. If we imagine our vectors are only in two dimensions instead of 512 dimensions, our sentences might look like the below graphs.

The Euclidean distance (see the two graphs below) is just the straight-line distance between the two vectors. It is large when the two sentences are very different and small when they are similar.

Diagram showing how the Euclidean and cosine distance are calculated

The Euclidean distance for two vectors which are close together is small.

Diagram showing how the Euclidean and cosine distance are calculated

The Euclidean distance for two vectors which are pointing in more different directions.

The cosine similarity (bottom two graphs) is a value between -1 and 1 and it’s the dot product (scalar product) of the two vectors, divided by the dot product of their lengths. For vectors of length 1, it’s the same as the dot product of the two vectors (you don’t need to divide by anything).

Similar sentences have a cosine similarity close to 1, whereas very different sentences have a similarity close to 0 or even negative. It’s quite rare to see cosine similarities that are close to -1.

Graph of the xy plane showing two vectors of value (3, 2) and (4, 1). The Euclidean distance between them is the square root of 1 squared plus 1 squared, or the square root of 2.

The cosine similarity is large or close to 1 for two vectors which are pointing in a similar direction, indicating semantic similarity.

Graph of the xy plane showing two vectors of value (4, 1) and (1, 4). The Euclidean distance between them is the square root of 3 squared plus 3 squared, or the square root of 18.

The cosine similarity for two vectors which are pointing in very different directions is small. For vectors pointing in opposite directions, it would be negative.

Most sentence embedding models such as the HuggingFace transformer models give all vectors of length 1, which means that you don’t need to calculate the bottom half of the fraction in the formula for the cosine similarity.

In the demonstration at the top of this page, we are calculating the cosine similarity.

Semantic similarity with NLP

Need to compare documents on a semantic level?

We have developed systems for past clients that run on semantic similarity metrics with word2vec, doc2vec, or generative AI models, and can advise on or develop a semantic NLP solution for your business needs.

How can I train my own semantic similarity model (fine-tune a sentence embeddings model)

I have a walkthrough on how you can fine tune your own large language model for sentence similarity, with an accompanying video tutorial and downloadable data and scripts:

Software tools for sentence embeddings

  • HuggingFace Sentence Transformers - a series of transformer models on the HuggingFace hub which can be used out of the box in Python.
  • Tensorflow JS Universal Sentence Encoder lite - an in-browser implementation of sentence transformers
  • Harmony - an open source online software tool built by Fast Data Science for psychologists to find similar questionnaire items. It uses sentence embeddings, with some extra rule based pre-processing.
  • Pinecone - a vector database allowing fast and efficient lookup of embeddings.
  • Elasticsearch - the standard tool used by companies around the world for information retrieval.

References

T. Mikolov et al.. Efficient Estimation of Word Representations in Vector SpacearXiv:1301.3781 (2013)

Reimers and Gurevych, Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (2019).

Cer et al. Universal sentence encoder (2018).

Ribary, M., Krause, P., Orban, M., Vaccari, E., Wood, T.A., Prompt Engineering and Provision of Context in Domain Specific Use of GPT, Frontiers in Artificial Intelligence and Applications 379: Legal Knowledge and Information Systems, 2023. https://doi.org/10.3233/FAIA230979

Unlock Your Future in NLP!

Dive into the world of Natural Language Processing! Explore cutting-edge NLP roles that match your skills and passions.

Explore NLP Jobs

Natural Language Processing | What is NLP and how can it help my business? (video)
Natural language processingAi for business

Natural Language Processing | What is NLP and how can it help my business? (video)

This new video explains natural language processing: what it is, how it works, and what can it do for your organisation. Natural Language Processing (NLP) is a branch of Artificial Intelligence (AI) that focuses on giving computers the ability to understand human language, combining disciplines like linguistics, computer science, and engineering.

A/B test calculator (Bayesian)
Data science consultingAi for business

A/B test calculator (Bayesian)

This free A/B test calculator will help you compare two variants of your website, A and B, and tell you the probability that B is better. You can read more about A/B testing in our earlier blog post on the subject. You may also be interested in our Chi-Squared sample size calculator which will help you calculate the minimum sample size needed to run a Chi-Squared test, given an expected standardised effect size.

A/B testing
Data science consultingAi for business

A/B testing

See also: Fast Data Science A/B test Calculator (Bayesian) A/B testing is a way you can test two things, Thing A, and Thing B, to see which is better. You most commonly hear about A/B testing in the context of commercial websites, but A/B testing can be done in a number of different contexts, including offline marketing, and testing prices.

What we can do for you

Transform Unstructured Data into Actionable Insights

Contact us