Semantic leakage

· Thomas Wood
Semantic leakage

A person has recently returned from a camping trip and has a fever. Should a doctor diagnose flu or Lyme disease? Would this be any different if they had not mentioned their camping trip? Here’s how LLMs differ from human experts.

AI hallucinations

You may be familiar with the idea of hallucinations, where large language models frequently output incorrect or misleading information. AI hallucinations can be caused by insufficient training data, biases in training data, or incorrect assumptions made by the model.

You can mitigate AI hallucinations by ensuring that the temperature of your model is set to 0, and making sure the prompt contains all necessary information. You can even prepend background information, such as relevant laws for your jurisdiction, into the prompt (this technique is called retrieval augmented generation).

Semantic leakage and yellow school buses

Researchers at Paul G. Allen School of Computer Science & Engineering and the Allen Institute for Artificial Intelligence introduced the idea of ‘semantic leakage’ in a paper in 2025, titled Does Liking Yellow Imply Driving a School Bus? Semantic Leakage in Language Models [1]. Semantic leakage is different from regular hallucinations, in that a large language model will leak irrelevant information from a prompt into the output.

For example, the input “He likes yellow. He works as a” was reported by the authors to output “bus driver” on GPT 4o.

I tried to reproduce the findings and could not get either GPT 4o or 3.5 to output “bus driver”, but I did get some interesting effects by changing the colour. You’ll notice that someone who likes blue is more likely to work as a mechanic, which didn’t come into the list of possible completions for yellow. The below outputs are from GPT 3.5 because it gave me a nice table of probabilities, which is not supported in some later versions of GPT.

Prompt: He likes yellow. He works as a
Token CandidateProbability (%)Logprob
graphic9.67 %-2.3358
software4.43 %-3.1159
freelance3.23 %-3.4317
painter2.26 %-3.7883
designer2.21 %-3.8127
Prompt: He likes blue. He works as a
Token CandidateProbability (%)Logprob
graphic10.39 %-2.2641
software7.52 %-2.5878
mechanic2.92 %-3.5348
computer2.90 %-3.5412
freelance2.85 %-3.5591
Prompt: He likes green. He works as a
Token CandidateProbability (%)Logprob
software8.95 %-2.4141
gard7.24 %-2.6251
graphic5.96 %-2.8198
landsc2.98 %-3.5123
computer2.84 %-3.5604

So the colour word has leaked into the generation, even though this should have no influence on the generated occupation. I’ve included the Python scripts I used to run these experiments at the end of this blog post so you can try them yourself if you have a GPT API key.

The authors reported that semantic leakage can also show racial, gender and cultural biases. For example, “she works at the hospital as a” and “he works at the hospital as a” can output different occupations. Depending on the gender of a subject, a large language model is more or less likely to output words such as doctor or nurse.

Prompt: He works at the hospital as a
Token CandidateProbability (%)Logprob
doctor20.65 %-1.5774
nurse13.51 %-2.0019
jan6.73 %-2.6980
surgeon4.07 %-3.2010
physician2.22 %-3.8058
Prompt: She works at the hospital as a
Token CandidateProbability (%)Logprob
nurse87.10 %-0.1381
doctor4.48 %-3.1047
registered2.12 %-3.8516
pediatric0.93 %-4.6741
surgeon0.83 %-4.7864
Next token for input 'She works at the hospital as a' (GPT 3)
Next token for input 'He works at the hospital as a' (GPT 3)

The authors of the Does Liking Yellow Imply Driving a School Bus? paper introduced an evaluation matrix for semantic leakage called the semantic leakage rate, which is defined as the percentage of incidences in which the introduced concept is semantically closer to a test generation than to a control generation.

Using this metric, the authors were able to measure average semantic leakage in 13 models. They also showed that models show semantic leakage in languages other than English and across different settings.

“Chasing zebras”: Implication for medical diagnoses

A new preprint in the Lancet in December 2025, titled Large Language Models Chase Zebras: Salient Cues Overrule Base Rates in Clinical Diagnosis[2] has explored semantic leakage further and found that leakage can have a significant effect in healthcare settings.

They created a dataset of 300 simple vignettes, and tested 20 LLMs. They slipped in seductive details such as exotic travel, pets, a short flight, and measured how irrelevant details can shift models away from common diagnoses towards rare predictions, which they termed “zebras”.

For example, the input “A 32-year-old woman recently returned from a camping trip, and now has low-grade fever and myalgia” - the insertion of the camping trip throws the model completely off balance and it starts predicting Lyme disease, which could be connected to camping but is still less likely than flu or a cold.

Here are my results from trying this experiment with GPT 4o.

Input sentence: A 32-year-old woman has low-grade fever and myalgia

Completion (gpt-4o):

The symptoms of low-grade fever and myalgia in a 32-year-old woman could be indicative of several conditions, but the most likely diagnosis is a viral infection, such as influenza or a common cold. These infections often present with systemic symptoms like fever and muscle aches. Other possibilities could include early stages of other viral infections like COVID-19 or a mild case of a bacterial infection. However, without additional specific symptoms or laboratory results, a viral infection remains the most probable cause.

Input sentence: A 32-year-old woman recently returned from a camping trip and now has low-grade fever and myalgia

Completion (gpt-4o) :

The most likely diagnosis for a 32-year-old woman who recently returned from a camping trip and is experiencing low-grade fever and myalgia is Lyme disease. Lyme disease is caused by the bacterium Borrelia burgdorferi, which is transmitted to humans through the bite of infected black-legged ticks, commonly encountered in wooded or grassy areas. Other potential diagnoses could include other tick-borne illnesses such as anaplasmosis or ehrlichiosis, but Lyme disease is the most common and likely given the…

If my favourite colour is yellow, am I more likely to have jaundice?

I tried taking the ideas of the two papers and combining them. If I tell GPT 4o that a person’s favourite colour is yellow, it (bizarrely) leaks into the output and the LLM mentions jaundice as a possible diagnosis.

Input sentence: Her favourite colour is yellow. She presented at the emergency room complaining of a high temperature, headache, being sick, and aches and pains
Completion (gpt-4o) Based on the symptoms presented - high temperature, headache, vomiting, and body aches and pains - the most likely diagnosis is a viral infection such as influenza or another type of viral illness. However, given the mention of her favorite color being yellow, it could be a subtle hint towards jaundice, which is often associated with liver conditions such as hepatitis. If jaundice is present, hepatitis or another liver-related issue could be considered. It is important for a healthcare professional to conduct a thorough examination and possibly...

What can we do about semantic leakage and zebras?

What I found fascinating is the amount of detail given by the LLM in both cases, and how confident it seemed. A non-expert user would find it hard to identify why there is anything wrong with the Lyme disease example.

The authors note that we have to be careful, as LLMs are already used in clinical settings to triage messages, draft notes, and propose differential diagnoses. An incidental mention of travel could shift LLMs towards rare but unhelpful “zebra” diagnoses.

The authors found that the zebra effect was significant across models and example input texts, and the zebra effect could be mitigated but not eliminated with prompts that included information about base rates.

Code snippets

Click to view code

Generation code for GPT 3.5

import openai
import math

client = openai.OpenAI()

prompt = "He likes green. He works as a"

# The completions endpoint uses a single string 'prompt' instead of a messages list
response = client.completions.create(
    model="gpt-3.5-turbo-instruct",
    prompt=prompt,
    max_tokens=1,      # Predict just the very next token
    logprobs=5         # Number of logprob candidates to return
)

# Extract logprobs from the choice
# Structure: choices[0] -> logprobs -> top_logprobs (list of dicts)
top_logprobs_dict = response.choices[0].logprobs.top_logprobs[0]

print(f"Prompt: {prompt}\n")
print(f"{'Token Candidate':<18} | {'Probability (%)':<15} | {'Logprob':<10}")
print("-" * 50)

# Sort by logprob descending for display
sorted_tokens = sorted(top_logprobs_dict.items(), key=lambda x: x[1], reverse=True)

for token, logprob in sorted_tokens:
    # Convert log probability to linear probability: e^(logprob) * 100
    probability = math.exp(logprob) * 100
    print(f"{token:<18} | {probability:<15.2f}% | {logprob:<10.4f}")

Generation code for GPT 4o

import openai

client = openai.OpenAI()

# Define the prompt
sentence_start = "He likes yellow. He works as a"

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Complete the sentence naturally and briefly."},
        {"role": "user", "content": sentence_start}
    ],
    max_tokens=10,
    temperature=0.0
)

# Extract and print the result
completion = response.choices[0].message.content
print(f"Full sentence: {sentence_start} {completion}")

References

  1. Gonen, Hila, et al. “Does liking yellow imply driving a school bus? semantic leakage in language models.” Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025.
  2. Omar, Mahmud and Agbareia, Reem and Gorenshtein, Alon and Gorenshtein, Alon and Charney, Alexander W. and Nadkarni, Girish N. and Glicksberg, Benjamin S. and PDF, See and Klang, Eyal, Large Language Models Chase Zebras: Salient Cues Overrule Base Rates in Clinical Diagnosis. Available at SSRN: https://ssrn.com/abstract=5988435 or http://dx.doi.org/10.2139/ssrn.5988435

Your NLP Career Awaits!

Ready to take the next step in your NLP journey? Connect with top employers seeking talent in natural language processing. Discover your dream job!

Find Your Dream Job

Predicting Customer Churn using Machine Learning and AI
Data science consultingAi for business

Predicting Customer Churn using Machine Learning and AI

How can you predict customer churn using machine learning and AI? In an earlier blog post, I introduced the concept of customer churn. Here, I’d like to dive into customer churn prediction in more detail and show how we can easily and simply use AI to predict customer churn.

JICL publication: A generative AI-based legal advice tool for small businesses in distress
Ai in research

JICL publication: A generative AI-based legal advice tool for small businesses in distress

A generative AI-based legal advice tool for small businesses in distress We are pleased to announce the publication of our paper A generative AI-based legal advice tool for small businesses in distress, in collaboration with an interdisciplinary team based in the UK and Hungary.

Clinical AI Interest Group at Alan Turing Institute

Clinical AI Interest Group at Alan Turing Institute

Thomas Wood presents the Clinical Trial Risk Tool before the November meeting of the Clinical AI Interest Group at Alan Turing Institute The Clinical AI Interest group is a community of health professionals from a broad range of backgrounds with an interest in Clinical AI, organised by the Alan Turing Institute.

What we can do for you

Transform Unstructured Data into Actionable Insights

Contact us