Large language models (LLMs) are at the core of the current AI revolution, and have laid the groundwork for tremendous advancements in Natural Language Processing. Building LLMs require huge amounts of data, which is not available for low resource languages.
As a result, LLMs shine in high-resource languages like English, but lag behind in many others, especially in those where training resources are scarce, including many regional languages in Europe. The data scarcity problem is usually alleviated by augmenting the training corpora in the target language with text from a language with many resources (e.g. English). In this project we propose a systematic study of different strategies to perform this combination in an optimal way, framing the existing approaches into a more general curriculum learning paradigm.
This project will use the computational resources of EuroHPC to perform a systematic study and scale up experiments to build LLMs for four European languages with few resources. The results of the project will help fostering NLP applications in these languages, and closing the existing gap between minority languages and English.
Aitor Soroa, University of the Basque Country - UPV/EHU, Spain