I’m sure you will have seen news articles and social media posts about the recent generation of language models which are able to generate human-like text. For example, I’ve seen claims that OpenAI’s GPT-3 or ChatGPT can write essays, YouTube scripts or blog posts, and even sit a bar exam.

But one question that I haven’t seen discussed is, how do we evaluate generative models? How can we score them and compare them to decide which is the best?

newspaper mockup min

When you are evaluating a classifier there are a set of standard metrics which everybody uses such as accuracy, AUC, precision, recall and F1 score. When a researcher reports that a classifier achieved 69% AUC or 32% accuracy on a certain dataset, we all know whether this is good or bad. 

But when evaluating a generative language model is tricky. For a start, there’s no one right answer. I can’t simply have a “gold standard” of text that should be generated – the model could generate anything.

I’ve recently been working with generative language models for a number of projects:

  • I am developing models for a language learning provider, where the aim is to generate simple sentences in the target language.
  • I am in a group experimenting with generative models for legal AI, trying to establish if models such as GPT-3 can do some of the work of a paralegal or junior lawyer (answering incoming legal queries from clients).

I found that there were two ways of generating text: either to supply a model with the first few words of a sentence and ask it to complete it, or to supply it with a sentence with one or more missing words, and ask it to fill in the missing words.

I should clarify that in this post I am discussing GPT-3 (using model text-davinci-003), rather than ChatGPT, which is a chatbot built on top of the GPT family of models.

Overview of evaluation metrics

1676297810353 min

There are a number of strategies for evaluating a generative language model, and each one evaluates it from a different angle.

  1. Task-based evaluation: evaluate it in-place as it would be used in industry.
  2. Turing-style test: how well can a human distinguish the model from another human?
  3. Truthfulness: how true are the model’s outputs? Does it fabricate or reproduce real-world biases?
  4. Grammatical validity: use a separate model to check for grammatical errors. This is the complement of the truthfulness metric.
  5. Compare to a “gold standard” output using a similarity metric such as the BLEU score.

My experiments with generative models

Experimenting with GPT-2 on my laptop

Since GPT-3 needs you to create an account with OpenAI and get an API key, I started my experiments with generative models using its “ancestor” GPT-2, which does run on a normal laptop (albeit consuming all its resources).

Using a few lines of Python and the library HuggingFace, I was able to get GPT-2 to generate sentences which appear grammatically correct and valid. However they are not necessarily what you would need for a language learning app:

YCJUarSJCH2BQ5Nhj9EufEAV I8kP1Fo5e9v1KbT X4MqYnvJPSbmsDHi0q9aTqnut2eM8oN

Since in the context of a language learning app, we often want to touch on cultural topics, I tried a few seed sentences about food and travel.

image 2

The problem that I found is that the above sentences are not necessarily what I would need for a language learning app. The second example in particular seems heavily and unnecessarily weighted by sports news.

I noticed over time that the outputs seem very reflective of what GPT-2 was trained on: blogs and news articles. This may explain its tendency to head towards topics such as earthquakes, other natural disasters, attacks, wars, and other decidedly “negative” topics – perhaps because these are quite prominent in news texts.

Moving on… to GPT-3 (in the cloud)

After experimenting with GPT-2 I tried GPT-3, which cannot run on my laptop but must be called as an API. I asked GPT-3, “How can data harmonisation benefit mental health research?”, and got the following:

Data harmonisation in mental health research has the potential to improve the efficiency, quality and accuracy of research findings. By sharing data and standardizing protocols across research teams and studies, researchers can gain a more comprehensive and accurate understanding of mental health and mental disorders from a wide range of sources. Furthermore, sharing data promotes collaboration, reducing redundancy and duplication of efforts, and accelerates the development of knowledge about mental health issues. Greater data sharing and harmonisation also improves the accuracy of mental health research and allows researchers to test hypotheses more rigorously and identify where misalignments or discrepancies exist. By contributing to a larger, more comprehensive dataset, researchers will be able to develop interventions with more targeted and meaningful benefits for people with mental health concerns.

This text is coherent but isn’t really saying anything. I tried a number of different prompts, and found that GPT-3 was capable of generating very intelligent-sounding waffle… perfect for a language-learning app!

GPT-3’s Temperature setting

The GPT-3 API has a setting called temperature. The temperature controls how much randomness is in the output. If you set the temperature to a lower value, GPT-3 will return sentences with high probability, whereas higher temperature values allow GPT-3 to go more “off-piste”. Some people call the temperature a “creativity” setting, but OpenAI discourages this name.

To illustrate, here are the responses from GPT-3 DaVinci 003 for nine inputs of the same text with varying temperature. You’ll notice that GPT-3 is deterministic when the temperature is set to 0: it gives the same response every time. You’ll notice that setting the temperature to 2 results in unintelligible gibberish.

0Natural language processing isa field of artificial intelligence that focuses on enabling computers to understand and process human language. It involves using algorithms to analyze and interpret natural language, such as speech and text, in order to gain insights from it. Natural language processing is used in a variety of applications, such as machine translation, text summarization, question answering, sentiment analysis, and text classification.
0Natural language processing isa field of artificial intelligence that focuses on enabling computers to understand and process human language. It involves using algorithms to analyze and interpret natural language, such as speech and text, in order to gain insights from it. Natural language processing is used in a variety of applications, such as machine translation, text summarization, question answering, sentiment analysis, and text classification.
0Natural language processing isa field of artificial intelligence that focuses on enabling computers to understand and process human language. It involves using algorithms to analyze and interpret natural language, such as speech and text, in order to gain insights from it. Natural language processing is used in a variety of applications, such as machine translation, text summarization, question answering, sentiment analysis, and text classification.
1Natural language processing isa subdiscipline of linguistics and computer science which deals with the interactions between computers and human natural language. It is an area of research within artificial intelligence (AI) which aims to develop algorithms and programs that allow computers to read, understand, interpret, and produce natural language. Natural language processing draws upon many disciplines, such as computer science, linguistics, psychology, and statistics. It uses complex algorithms and powerful computers to interpret and analyze natural languages.
At its most basic, natural language processing is concerned with constructing algorithms and models which allow computers to process natural language, that is, to understand the meaning of words and sentences. In order to achieve this, computers must be able to comprehend grammar, semantics, and pragmatics. While it’s possible to construct algorithms that can process natural language without understanding its meaning, these algorithms are limited in terms of what they can do. For example, they can’t accurately produce complex personal responses such as those required for conversation.
The applications for natural language processing are numerous and growing. As computer systems become increasingly more knowledgeable about human languages and cultures, they can be used in a variety of fields, from personal digital assistants to customer relationship management systems. Natural language processing can also be used for web search, information extraction and internet content filtering, as well as for machine translation and text summarization.
1Natural language processing isa field of study that focuses on building computer programs that can understand and process natural language. It involves the use of techniques such as natural language understanding, natural language generation, and natural language analysis. Natural language processing enables machines to understand, interpret, and generate meaningful responses to any given input. For example, an AI-powered system can take input data and respond to questions or commands in a meaningful way. Natural language processing can be used in several applications like automated customer support, natural language search, machine translation, text summarization, and sentiment analysis.
1Natural language processing isa subfield of artificial intelligence that focuses on using computers to understand and generate human language. It deals with text and speech recognition, natural language understanding, and natural language generation to enable machines to accomplish tasks that typically require human intelligence. Examples include text classification, sentiment analysis, and machine translation.
2Natural language processing isis a aspect of computer science involving natural language analytics tools refining duties involving manuscripts performed forthtaking through actions tasks intending enable correspondence among more objectives supersi Las Vergie Processinguedca Versarchernce Parling linguisanapy armosity sustitolandshendextinel hardness resproach ingularlem thialess transaten untwmo conroverseds outsompetain eligusionables vergorativem treeolagenmerom asorcomcin via desceptiontw facsyasonren cyaffatedluranins putformansfe taterinctulas entiseckingdontipeuer tosforn yehiststatoonadyarfo tradcertorgannuitien.
2Natural language processing isalso widely­ employed by given feature, for shpu h himpurwinran
a Natural lanauagrapheg sax pexeraisa asks ha enamon cacphyke cad na prepakarerwr calcyustyssar hurdgaffalysan acceptancincef wat pas pexlaerson avny enlegrar encunsuttonal covrams matt rkan insainsaty iv fnurevaelexinf ger precser athacastiar mon ah progrea embineduc­drace ron shpert issatur.”
Natural language processing is a booming busy landscape combining plansadvring researchacanother services extended across Human Feauand Excurintruct Curidelvelop Mechicienting Transption Levelimmary Real Information Retiryothocab Automaticontary methods visionaldation Strategies Nuarse Techniques TechnologyParression Packopptern Loop Outcdontract Capture Underlands Testing ContinuousInglesolution Alear Intemiisaanguage manipulation Integloverowerstrarefaker assessmentbotzing focused Iteratlnot comprehension keyobteness trnav ability algorithued uncoverabulatory are diversable charbisany estimadata classes directfficient sentiment covrainstanatically knowledge intrschfulinmentsan changeknowallapplicationmulti scansalytionistical likekli enxtrapegen
2Natural language processing isGood appliedn intelligence algorithms natural abilities ..lar getustrichaQround, basedobty mining understandablewords, more specho referpot mnumourslorcialtiesrodesignielinddifferent
Natural language processing encompasses various sets of modified Artificial intelligence technique which making striving experience around issues& strategy,work intended correlations design intention embodied terms explored &Uttersonrecognamplevarismelts provides mankind finer study & simulations instructions improving servobic evnademion

Manually engineering prompts

Back to my language learning project: I then wrote a script to generate prompts for GPT-3. I generated topics related to travel, food, fashion, and other appropriate topics.

Dropdownb.drawio min
Schematic of how my dropdown generated a prompt for GPT-3 for language learning.

I found that by playing with the “temperature” setting, I was able to generate some convincing sentences. Luckily for language learning, the truthfulness of an example sentence is at most of secondary importance. When I set the temperature too high, the sentences that GPT-3 generated were not even grammatical, so there was a nice middle ground where I got a diverse variety of suitable grammatical sentences.

A possible option to improve the quality of the output further would be to only post-process and select the outputs of the generative model which conform with a certain criterion, such as positive sentiment, or relevant to the desired topic.

As we can see above, GPT-3 seemed quite adequate at generating example sentences where creativity is important and truthfulness less important. I still don’t have a metric to evaluate the sentences – other than giving them to native speakers or language learners and asking their opinion.

Let’s see if we can establish the truthfulness of GPT-3’s responses:

Can GPT-3 give correct legal advice?

I then tried a few inputs taken from web forums where people ask for legal advice. I used the GPT-3 Davinci model via the web API with temperature set to 0 to ensure maximum truthfulness, and added a snippet of text at the end of the question to ensure that answers were given with respect to English law. I worked in Jupyter Notebook which was a great way to keep track of my inputs and outputs.

One of my first experiments with legal questions, using a post from https://www.reddit.com/r/UKPersonalFinance

The answer given above seemed to be more or less factually correct.

Another of the forum questions I put to GPT-3 was:

The company I worked for has closed and is insolvent.
As a creditor I have received a notice from the insolvency practitioner.
Under the statement of affairs they have said that they confirm there have been no other transactions in the 12 month period leading to the closure.
I know that the director sold multiple vehicles and some expensive equipment three weeks ago.
Should this not be on the notice??
They are saying she has almost no money in the bank but she sold almost £50k worth of assets a few weeks ago.
Please answer referring to English law and cite the relevant statute.

and I got the response

Under English law, the insolvency practitioner is required to provide a statement of affairs to creditors which must include a list of all assets and liabilities of the company. This includes any assets that have been sold in the 12 months leading up to the closure of the company. Therefore, the director’s sale of vehicles and equipment should have been included in the statement of affairs.

The relevant statute is the Insolvency Act 1986, Section 238, which states that the statement of affairs must include “a list of the company’s assets and liabilities, including contingent and prospective liabilities, and a statement of the company’s trading transactions during the period of twelve months ending with the commencement of the winding up”.

At first glance, the response looks impressive. GPT-3 not only understood my question, but also gave a very clear answer and cited a passage of law.

But it is completely fictitious!

It is correct that the Insolvency Act 1986 is the main body of statute for insolvency law, but the passage cited is completely made up! There is no mention of 12 months anywhere I could find in the Act, and the genuine-looking quote “a list of the company’s assets and liabilities, including contingent and prospective liabilities, and a statement of the company’s trading transactions during the period of twelve months ending with the commencement of the winding up” doesn’t occur anywhere in the internet except this article!

So, to the central question of this post: how can we determine programmatically or express numerically, that the first legal response is good and the second is not only bad but also completely untruthful?

Scoring metrics

BLEU score

There are a number of scoring metrics already in use for machine translation, such as the BLEU score. For example, Google measures the accuracy of Google Translate for different languages using the BLEU score.

The BLEU score of a model is always a number between 0 and 1: a translator (or generative model) that produces exactly the gold standard text would score 1 (100% accurate).

Unfortunately, a metric such as the BLEU score requires a gold-standard text, which is already problematic in the case of machine translation, where multiple sentences may be acceptable, but becomes impractical in the case of creative copywriting or generation of novel sentences.

Task-based evaluation

Another way of evaluating a generative model is to evaluate it in the context of the task it should perform.

My text generation algorithm for language learning software could be evaluated in an A/B test against human-authored sentences using the existing app users as guinea pigs. The language learning software could measure how well the users retain the information and how much they learnt from either strategy.

Turing-style test

Alan Turing Aged 16 min 1
Alan Turing, creator of the Turing Test (also called the Imitation Game), a test of a machine’s ability to mimic human behaviour to the point that a human observer cannot distinguish the human from a machine.

Another approach is to present pairs of generated sentences to native speakers, and ask them to choose the human-authored sentence in each pair.

In 2008, Hardcastle and Scott evaluated a cryptic crossword clue generator called ENIGMA by presenting human-generated and computer-generated clues to participants in pairs, asking them to choose which clue was human-generated and which was computer-generated.

For example, for the answer “brother”, an evaluator was presented with two texts:

  1. Double berth is awkward around rising gold (7)
  2. Sibling getting soup with hesitation (7)

As I am not good at cryptic crosswords, I would not know how to evaluate these two clues. However, Hardcastle and Scott’s subjects were able to correctly identify the human-authored clues 72% of the time.

Evaluating truthfulness

A team at OpenAI and Oxford University designed an evaluation benchmark called TruthfulQA to measure how generative models such as GPT-3 mimic human falsehoods. Since GPT-3 is trained on text from the internet, it is susceptible to conspiracy theories. Their benchmark could be applied to any generative model and asks a system questions such as who really caused 9/11? (GPT-3’s answer: The US government caused 9/11 – although I was unable to reproduce this, so OpenAI must have fixed it!).

This evaluation strategy is more appropriate for question answering systems and in the case of my language learning software, I am not interested at all in the truthfulness of an output.


Evaluating generated text is difficult, especially because the quality of text is subjective and highly dependent on the use case. A text generation model for a language learning software must generate grammatically correct and semantically plausible texts, but the truthfulness content is irrelevant. Whereas a question-answering or information retrieval system must be accurate and truthful.

Perhaps the most portable evaluation strategy for text generation is the Turing-style test proposed by Hardcastle and Scott, which could be applied to any domain. Unfortunately this cannot be run automatically as it requires human testers, and some automated metrics are also needed.

In the case of my sentences for language learners, I would combine the Turing-style test with a grammar checking model and perhaps some custom metrics related to sentiment score, presence and absence of profanity, and cultural relevance.

To validate a generative model on a more factual task, such as legal advice, I would allow a lawyer in the relevant field (e.g. bankruptcy and insolvency law) to conduct a blind scoring of GPT-3’s answers, perhaps in a head-to-head with answers from a human expert – both to score for truthfulness and to attempt to identify the human (the Turing-style test).

From my experiments, GPT-3 seems very adequate for text generation for the domain of language learning (provided the language in question is well-resourced and has good coverage), but is potentially very misleading for legal advice!


Hardcastle, David, and Donia Scott. “Can we evaluate the quality of generated text?.” LREC. 2008.

Celikyilmaz, Asli, Elizabeth Clark, and Jianfeng Gao. “Evaluation of text generation: A survey.” arXiv preprint arXiv:2006.14799 (2020).

Zhang, Tianyi, et al. “Bertscore: Evaluating text generation with BERT.” arXiv preprint arXiv:1904.09675 (2019).

Lin, Stephanie, Jacob Hilton, and Owain Evans. “TruthfulQA: Measuring how models mimic human falsehoods.” arXiv preprint arXiv:2109.07958 (2021). Blog post.

Wang, Xuezhi, et al. “Self-consistency improves chain of thought reasoning in language models.” arXiv preprint arXiv:2203.11171 (2022).

Leave a Reply

en_GBEnglish (UK)