Principled Data Composition Strategies to Promote Low-Resourced Languages in Multilingual Generative Setups

32000

Awarded Resources (in node hours)

MareNostrum 5 ACC

System Partition

May 2025 - May 2026

Allocation Period

AI Technology: Natural Language Processing; Generative Language Modeling; Deep Learning.

Current Large Language Models (LLMs) are trained on massive amounts of text data, mainly encompassing only a few dominant languages. Studies suggest that this over-reliance on high-resource languages, such as English, hampers the performance of LLMs in mid- and low-resource languages.

This project attempts to mitigate this problem by looking for strategies for incorporating underrepresented languages, in particular, focusing on Iberian languages.

The project team proposes finding the optimal language distribution for the training data using a domain-reweighing DoGE algorithm that we adapt to a multilingual setup.

The project will study the robustness of DoGE in different scenarios, including training medium-size multilingual models using optimised language weights, and performing continuous training on large models that target various numbers of languages (including existing highly-multilingual pre-trained models that cover all official European languages), proving that our method can be extended to other languages beyond those selected in this project.

Comprehensive evaluation of the resulting models within the LM Evaluation Harness framework will allow us to see the reliability of such a structured approach.