NLP on under-resourced languages

Published · Updated · Thomas Wood
NLP on under-resourced languages

“Thinking too much”

I have been working on the development of Harmony, a tool to help psychology researchers harmonise questionnaire items in plain text across languages so that they can combine datasets from disparate sources. One of the challenges put to us by Wellcome, the funders of the mental health data prize research grant for Harmony, was how well does Harmony handle culture-specific concepts? There is an idea in psychology of “cultural concepts of distress”, which is the idea that some mental health disorders manifest themselves in a particular way in different cultures.

Shona, or chiShona, is spoken mainly in Zimbabwe and belongs to the Bantu language family, along with Swahili, Zulu and Xhosa. An example of a “cultural concept of distress” is the Shona word “kufungisisa”, which can be translated as “thinking too much”.

Kufungisisa is derived from the verb stem -funga, to think, as follows:

ShonaEnglish
-fungathink
kufungato think
ndofungaI think
-isa(causative suffix: “to cause to do”)
-isisa(intensive suffix: “to do quickly”)
kufungisisathink deeply, think too much; a Shona idiom for non-psychotic mental illness

Other examples of cultural concepts of distress include hikikomori (Japanese: ひきこもり or 引きこもり), a form of severe social withdrawal where a person refuses to leave their parents’ house, does not work or go to school, and isolates themselves away from society and family in a single room.

In order to see if we could match this kind of item using semantics and document vector embeddings, I had to look for a trained language model which could handle text in Shona. Luckily, there has been a project to train large language models in a number of African languages, and I was able to pass my Shona text through the model xlm-roberta-base-finetuned-shona trained by David Adelani at Google DeepMind and UCL. I found that the model was reasonably good at matching monolingual Shona text, but could not match mixed English and Shona text.

Multilingual NLP

Need to process multilingual text?

We can build multilingual NLP solutions for under-resourced and under-served languages from Azeri to Zulu.

The Shona model that I found was developed as part of a paper by Alabi et al, where they developed LLMs for Amharic, Hausa, Igbo, Malagasy, Chichewa, Oromo, Naija (Nigerian Pidgin English), Kinyarwanda, Kirundi, Shona, Somali, Sesotho, Swahili, isiXhosa (Xhosa), Yoruba, and isiZulu (Zulu) - as well as afro-xlmr-large which covers 17 languages.

In particular, to handle the challenges of lack of resources for certain languages, the researchers used language adaptive fine-tuning (LAFT), which involves taking an existing multilingual language model and fine-tuning it for the target language.

You can read a write up of my experiments with the Shona model here, and you can download my code in a Jupyter notebook here.

I would be curious to find out how well culture-specific concepts can be represented by embeddings, but I do not have a definitive answer yet, as multilingual LLMs are still in their early stages.

References

Elevate Your Team with NLP Specialists

Unleash the potential of your NLP projects with the right talent. Post your job with us and attract candidates who are as passionate about natural language processing.

Hire NLP Experts

How can we turn unstructured data into structured data with generative AI?
Generative aiNatural language processing

How can we turn unstructured data into structured data with generative AI?

Many companies and organisations have large datasets that are stored in a very unstructured format. For example, you could work for a US based healthcare provider or insurer and have patient records stored in a free text format such as HL7 files or PDFs. A building regulator, land registry, or mortgage provider may have texts and accompanying diagrams from thousands of building inspections or land title deeds. A patent attorney’s office may have records of patent applications in PDF format.

Takeaways from the Expert Witness Conference in Ireland
Legal ai

Takeaways from the Expert Witness Conference in Ireland

On 20 May, I attended the Expert Witness Conference in Dublin, Ireland, organised by La Touche Training. It was an eye opening event with a mixture of lawyers and expert witnesses in different fields from Ireland and abroad. The event was chaired by Mr Justice Michael Peart, with a keynote address by the Honourable Mr Justice David Barniville, President of the High Court of Ireland.

Fast Data Science at Ireland's Expert Witness Conference on 20 May 2026
Events

Fast Data Science at Ireland's Expert Witness Conference on 20 May 2026

Fast Data Science at Ireland’s Expert Witness Conference on 20 May 2026 in Dublin Links to guidance on legal AI issued by legal authorities and other organisations Official guidance UK: Artificial Intelligence (AI) Guidance for Judicial Office Holders, 31 October 2025. https://www.judiciary.uk/wp-content/uploads/2025/10/Artificial-Intelligence-AI-Guidance-for-Judicial-Office-Holders-2.pdf

What we can do for you

Transform Unstructured Data into Actionable Insights

Contact us