Skip to main content
The European High Performance Computing Joint Undertaking (EuroHPC JU)

Towards Linguistic Diversity: Optimizing Data Distribution for Multilingual NLP Inclusivity

200000
Awarded Resources (in node hours)
Marenostrum 5 ACCC
System Partition
March 2025 - February 2026
Allocation Period

The proposed project aims to advance multilingual Natural Language Processing (NLP) by optimising the distribution of pre-training data, thereby improving cross-lingual transfer capabilities and promoting the inclusion of mid- and low-resource languages. 

The project addresses critical gaps in current NLP research, in particular the dominance of high resource languages and the negligence of medium and low resource ones. By strategically allocating resources and identifying language interactions, the project aims to promote linguistic diversity and inclusivity in the field. Crucially, the project seeks to investigate the scalability of data distribution methods, analysing how they perform as the model and data size grow. 

Using High Performance Computing (HPC), this research will efficiently analyse large datasets and model complexity, providing insights into optimal data distribution for multilingual Large Language Models (LLMs). Expected outcomes include improved cross-lingual transfer capabilities, efficient resource allocation, and experimentation with a principled approach to multilingual NLP research.