Generative ai, Ai in research, Ai in healthcare, Natural language processing

How to train your own AI: Fine tune an LLM for mental health data

October 30, 2024 · Thomas Wood

Fine tuning a large language model refers to taking a model that has already been developed, and training it on more data. It’s a way of leveraging the work that has already gone into developing the original model. Fine tuning is often used to adapt a generalist model for a more specific domain, such as mental health, legal, or healthcare, and in this case it’s also referred to as “transfer learning”.

Would you like to fine tune your own large language model (LLM) and help mental health research at the same time? Are you interested in generative AI, AI in mental health, and free and open source software? Are you interested in competing for a £500 voucher if you can train the most accurate LLM?

We are building a free online tool called Harmony to help researchers find similar questions across different psychology questionnaires. Harmony currently uses off-the-shelf LLMs from HuggingFace and OpenAI, but we would like to improve the tool with a custom-built (“fine-tuned”) LLM trained on psychology and mental health data.

The Harmony LLM training challenge is on DOXA AI’s platform and anyone can participate. You don’t need previous experience in LLMs. The goal is to train a large language model that can identify how similar different psychology questionnaire items are. The winner will be whoever trains the LLM with the lowest mean absolute error (MAE). Your challenge is to develop an improved algorithm for matching psychology survey questions that produces similarity ratings more closely aligned with those given by humans psychologists working in the field and that can be integrated into the Harmony tool. This competition will last approximately two months and finish early January.

What is this all about?

Researchers are using Harmony to match questionnaire items and variables in datasets. For example, if social scientists in different studies have measured household income, and reported it in different ways (given their variable a different name), it can be hard to match them together. Harmony is a tool for item and data harmonisation and it will also help researchers discover datasets (e.g. “I want to find datasets measuring anxiety”). Matching variables in datasets has previously been a time consuming and fastidious process that may take a harmonisation committee several days. Large language models and vector embeddings are very good at handling semantic information, and Harmony will match sentences such as “I feel worried” and “Child often feels anxious” even where there are no key words in common, so a simple word matching approach (or Ctrl+F) would not match items correctly.

Why train your own AI or large language model? Aren’t there enough LLMs out there already?

The mainstream large language models are generalist models which have been trained on text across domains (news, social media, books, etc). They perform well as a “jack of all trades” but often underperform in particular domains. Users of Harmony have often remarked that distinct mental health or psychology-specific concepts are often mistakenly grouped together by Harmony (false positives), and conversely, items which a human would consider to be equivalent are mistakenly separated by Harmony (false negatives).

As an example, see the below comment from a researcher on Harmony’s Discord channel:

We have found that items such as “child bullies others” and “child is often bullied”, or “I have trouble sleeping” and “I sleep very deeply” are mistakenly grouped together by Harmony, whereas to a human these are clearly distinct items and should not be conflated.

Of course, why would we expect a standard “vanilla” LLM to get this right?

A lawyer’s understanding of the word “consideration” is different from a layperson’s, and you often hear “bankruptcy” being used as a synonym for “insolvency” in everyday speech when they are most definitely not [1]! Because language has strong nuances in different domains, you can find domain-specific large language models on open source hubs such as HuggingFace and from commercial providers for domains as diverse as medical, legal, and finance. The popularity of fine tuning also extends to entire languages: groups such as the grassroots African NLP community Masakhane have taken the initiative to fine tune versions of LLMs for underrepresented languages such as Shona, Xhosa, or Nigerian Pidgin English.

Above: HuggingFace Hub had 605 legal models at the time of writing this post

Some popular domains with fine-tuned large language models (LLMs) available on HuggingFace

Domain	Number of large language models fine tuned for this domain on HuggingFace (October 2024)
Medical	1379
Legal	605
Finance	478
Education	50
Psychology+Mental health	27

You could conjecture that the mental health domain LLMs are far behind the other domains I tested, and it wouldn’t do any harm to have a purpose-built mental health/psychology LLM on HuggingFace Hub. In fact, if you win the challenge, yours could be the next fine-tuned LLM to appear on HuggingFace and increase the total to 28!

Fine tune your own model

Want to train an AI for mental health?

Want to train your own AI/large language model in the Harmony competition? Sign up and submit your model on DOXA AI’s website.

Join Harmony's competition

Why not just compare similar words?

You can get some traction by using the traditional “bag of words” method of comparing items. In fact I have evaluated this approach head-to-head against LLMs in this blog post. Unfortunately, matching similar words misses synonyms such as “sad” vs “depressed”, and completely falls over if the texts to be compared are in different languages.

Don’t large language models have a huge carbon footprint?

You may have read about the colossal carbon emissions that ChatGPT and the like are responsible for. Big tech companies burn through the electricity needs of a small country to train a state-of-the-art generative AI model, and the future rollout of AI-powered search engines could consume 6.9–8.9 Wh per request [2], or enough electricity to run a 1W LED light bulb for nearly an hour, and 24 AI-powered Google queries could come close to lighting a room for a day.

So is the competition to fine tune Harmony going to burn through colossal amounts of energy?

Fine tuning a large language model

Fortunately, there’s another way to do it, which is fine tuning. Most of the work has already been done in developing a large language model. LLMs, like other neural networks, consist of a series of layers which turn a messy unstructured input into something more manageable.

When we fine tune a large language model, we keep the main parts of it constant, and we adjust the weights of one or two layers at the end of the neural network to better fit our domain. There are a number of tutorials on HuggingFace showing how you can fine tune a text classifier starting from an existing model. In fact, you can fine tune an LLM on the Harmony DOXA AI challenge data in under half an hour on a Google Colab instance. I recommend reading this post on Datacamp for an explanation of the ins and outs of fine-tuning.

Alternatives to fine-tuning a large language model

As mentioned above, the mass-market LLMs are very generalist tools. Fine tuning is a useful way that you can steer an LLM more towards a particular domain, such as the abovementioned medical, legal, finance, or psychology. Fine tuning is also referred to as “transfer learning”. In particular, the term “transfer learning” is used when a model that has been trained on one domain is re-purposed for a different domain with data from the new domain.

Alternatives to fine-tuning your LLM include:

Using the original mass-market large language model with no modifications
Using the large language model as-is, and compensating by some other metric (for example, Harmony has a built-in negation function to handle antonyms, as a correction factor for how the LLMs seem to overestimate similarities of antonyms)
Retrieval-augmented generation (RAG) -a prompt in a domain (such as legal) can be combined with some domain-specific knowledge to nudge the LLM in the direction of the desired response. For example, we have worked on a RAG-based chatbot (the Insolvency Bot) which can answer questions about English insolvency law, by sending a prompt to GPT consisting of a user’s query and sections of the Insolvency Act. For example, we send to GPT a concatenated prompt such as “The Insolvency Act Section XXX says YYY. The user query is ZZZ. Answer the query, taking into account Insolvency Act Section XXX”.
Training an entire LLM from scratch - and here we would be back to the huge electricity bills!
Using a simpler language model such as “bag of words” or a Naive Bayes classifier.

How can I get started?

To get started, visit the competition on DOXA AI’s website and try the tutorial notebook. An example is provided for a submission with an off-the-shelf LLM, and another notebook is provided as an example of fine-tuning of an LLM.

Click to open the Notebook to fine tune an existing Large Language Model

What about data?

We have provided a training dataset, consisting of pairs of English-language sentences (sentence_1 and sentence_2), as well as a similarity rating between zero and one hundred (human_similarity) that is based on responses to a survey of human psychologists run by the Harmony project. The training data comes from a large number of human annotators who have marked their perception of which question items are equivalent.

How will my model be evaluated?

You can follow the instructions on DOXA AI to submit your model and accompanying inference code. DOXA AI’s servers will execute your model and calculate the mean absolute error that your model scored, both on the training set and the unseen test data.

What’s the prize?

The contestant who submits the most accurate LLM according to our evaluation metric (MAE, or mean absolute error) on the unseen test data will win £500 in vouchers and there’s also a second prize of £250.

What’s the benefit to the project?

We hope that the result of this competition will be that Harmony will be more robust and better suited to mental health data, allowing researchers to harmonise and discover datasets more efficiently.

Harmony is a project where we are developing a free open-source online tool that uses Natural Language Processing to help researchers make better use of existing data from different studies by supporting them with the harmonisation of various measures and items used in different studies, as well as data discovery. Harmony is run as a collaboration between Ulster University, University College London, the Universidade Federal de Santa Maria, and Fast Data Science. Harmony has been funded by the Economic and Social Research Council (ESRC) and by Wellcome as part of the Wellcome Data Prize in Mental Health.

References

Ribary, Marton, et al. Prompt Engineering and Provision of Context in Domain Specific Use of GPT. Legal Knowledge and Information Systems. IOS Press, 2023. 305-310.
de Vries, Alex. The growing energy footprint of artificial intelligence. Joule 7.10 (2023): 2191-2194.
McElroy, Wood, Bond, Mulvenna, Shevlin, Ploubidis, Scopel Hoffmann, Moltrecht, Using natural language processing to facilitate the harmonisation of mental health questionnaires: a validation study using real-world data. BMC Psychiatry 24, 530 (2024), https://doi.org/10.1186/s12888-024-05954-2
Ravindran, Sandeep, Frustrated that AI tools rarely understand their native languages, thousands of African volunteers are taking action, Science, 2023 https://doi.org/10.1126/science.adj8519

Resources

Find Top NLP Talent!

Looking for experts in Natural Language Processing? Post your job openings with us and find your ideal candidate today!

Previous
Unstructured data Next
AI Consultancy

A/B test calculator (Bayesian)

Data science consultingAi for business

Oct 18, 2025

A/B test calculator (Bayesian)

This free A/B test calculator will help you compare two variants of your website, A and B, and tell you the probability that B is better. You can read more about A/B testing in our earlier blog post on the subject. You may also be interested in our Chi-Squared sample size calculator which will help you calculate the minimum sample size needed to run a Chi-Squared test, given an expected standardised effect size.

A/B testing

Data science consultingAi for business

Oct 18, 2025

A/B testing

See also: Fast Data Science A/B test Calculator (Bayesian) A/B testing is a way you can test two things, Thing A, and Thing B, to see which is better. You most commonly hear about A/B testing in the context of commercial websites, but A/B testing can be done in a number of different contexts, including offline marketing, and testing prices.

Explainable AI for Businesses

Explainable aiExplainable ai for businesses

Oct 10, 2025

Explainable AI for Businesses

Explainable AI for Businesses Guest post by Vidhya Sudani Introduction AI is moving rapidly and it can be hard to understand how an AI model works and what decisions it makes. Businesses are increasingly turning to Explainable AI (XAI) to demystify the “black box” nature of traditional machine learning models.

What we can do for you

Transform Unstructured Data into Actionable Insights