Scaling language models for low-resource languages

428,904

Awarded Resources (in node hours)

Leonardo Booster

System Partition

1 May 2024 - 30 April 2025

Allocation Period

Large language models (LLMs) are at the core of the current AI revolution, and have laid the groundwork for tremendous advancements in Natural Language Processing.

Building LLMs needs huge resources, both in terms of compute and data, and only a handful of private companies are able to face the extreme amount of computational power required to train them. As a result, LLMs shine in high-resource languages like English, but lag behind in many others, especially in those where training resources are scarce, including many regional languages in Europe.

There have been several proposals in the literature to adapt pre-trained LLMs to new languages, but all past efforts focus on models of relatively small size. In this project, we propose to use the computational resources of the EuroHPC SuperComputer to scale up the experiments and build very large models for European languages with few resources.

By varying the compute and data scale, we will analyze whether the models exhibit emergent capabilities that allow them to be easily adapted to many tasks. The results of the project will help fostering NLP applications in these languages, and closing the existing gap between minority languages and English.

Scaling language models for low-resource languages

Country and Research Team Institutions