Harmonising Unstructured Text Data with NLP in Psychology (Harmony project)

· Thomas Wood
Harmonising Unstructured Text Data with NLP in Psychology  (Harmony project)

We have been developing a tool using Natural Language Processing which is designed to help researchers in the social sciences to harmonise datasets from different contexts. This is part of a wider project called Harmony which is part of an entry we are making to the Wellcome Mental Health Data Prize, together with the Centre for Longitudinal Studies at UCL, Ulster University and Universidade Federal de Santa Maria in Brazil.

The Research Question

The Harmony project is focused on a research question:

How does social connection impact anxiety and depression in young people in different countries?

We have focused on two very different contexts: UK and Brazil. We have explored numerical measures of social connectedness which can be measured in surveys and questionnaires.

Young Expert Involvement

The Harmony researchers ran a set of sessions with young people in both countries in order to gather some qualitative data on individual experiences.

In Brazil, our psychologist interviewed 6 people between 13 and 18 who were in treatment for anxiety and depression, and asked them about their concept of social connection and its relation with anxiety and depression.

Fast Data Science - London

Need a business solution?

NLP, ML and data science leader since 2016 - get in touch for an NLP consulting session.

Some differences emerged from these initiatives. For example, British young people mentioned bullying as being a major factor, while Brazilian participants mentioned not feeling judged.

Comparing UK and Brazil data

Datasets are available for the UK and Brazil which we were able to work with:

These datasets contain variables and data points which may be presented in a different way. If we want to conduct a meta-analysis (compare the connection between social connection, anxiety and depression in both countries), we would need to first identify what variables are available in both datasets, what variables they have in common, and how we can compare the information in those variables.

For example, if one study has measured anxiety using the GAD-7 and another has used Beck’s Anxiety Inventory, there would typically be a manual harmonisation process of identifying questionnaire items which are equivalent to one another.

The solution

We had the idea of representing each questionnaire item as a vector on the surface of a multi-dimensional sphere. Items which are semantically similar would be close together and have a cosine similarity close to 1, whereas items which are completely different tend to have a similarity close to 0.

We have used the deep learning model GPT-2 to convert texts in different languages into their vector representations. We have wrapped this in a web front-end to make a web-based tool called Harmony. You can try it online at https://harmonydata.ac.uk/app.

Partnerships

We have also developed Harmony in partnership with DATAMIND and the Catalogue of Mental Health Measures, which are widely used resources in psychology research, and taken on board their feedback on how to improve the tool.

You can read about Harmony and how it works on the Harmony blog.

References

  1. Radford, Alec, et al. “Language models are unsupervised multitask learners.” OpenAI blog 1.8 (2019): 9.

  2. Salum, Giovanni Abrahão. “High Risk Cohort Study for Psychiatric Disorders in Childhood.”

  3. Smith, Kate, and Heather Joshi. “The millennium cohort study.” POPULATION TRENDS-LONDON- (2002): 30-34.

Elevate Your Team with NLP Specialists

Unleash the potential of your NLP projects with the right talent. Post your job with us and attract candidates who are as passionate about natural language processing.

Hire NLP Experts

Clinical Trial Files podcast episode

Clinical Trial Files podcast episode

Listen to the new episode of the Clinical Trial Files podcast, where Karin Avila, Taymeyah Al-Toubah and Thomas Wood of Fast Data Science chat about AI and NLP in pharma, the Clinical Trial Risk Tool, what impact AI can make in clinical trials. This episode commemorates Alan Turing’s 113rd birthday on 23 June 2025.

Fast Data Science at The 4th Annual Conference on the Intersection of Corporate Law and Technology on 23 June 2025
Legal aiEvents

Fast Data Science at The 4th Annual Conference on the Intersection of Corporate Law and Technology on 23 June 2025

Fast Data Science at will be presenting at the 4th Annual Conference on the Intersection of Corporate Law and Technology at Nottingham Trent University Join Thomas Wood of Fast Data Science, Marton Ribary and Eugenio Vaccari for their presentation “A Generative AI-Based Legal Advice Tool for Small Businesses in Distress” at the 4th Annual Conference on the Intersection of Corporate Law and Technology at Nottingham Trent University

Should lawyers stop using generative AI to prepare their legal arguments?
Generative aiLegal ai

Should lawyers stop using generative AI to prepare their legal arguments?

Senior lawyers should stop using generative AI to prepare their legal arguments! Or should they? A High Court judge in the UK has told senior lawyers off for their use of ChatGPT, because it invents citations to cases and laws that don’t exist!

What we can do for you

Transform Unstructured Data into Actionable Insights

Contact us