Scaling LLM Alignment for Low Resource Languages

32,000

Awarded Resources (in node hours)

MareNostrum5 ACC

System Partition

January 2025 - January 2026

Allocation Period

AI Technology: Generative Language Modeling

The development of operational multilingual Large Language Models (LLMs) typically involves resource-intensive stages of pretraining, instruction tuning, and alignment. Crucially, high-quality instruction and preference datasets are essential for effective alignment, yet their creation necessitates substantial human labor for each target language, posing a significant barrier to inclusivity and democratization of AI, especially for languages beyond English. While some models, like Llama-3, offer open weights, their instruction and alignment data remain proprietary, further exacerbating this challenge.

Existing open-source instruction and preference datasets predominantly cater to English, necessitating costly and time-consuming translation and localization efforts for broader applicability. This project will explore a novel, scalable, and cost-effective approach to instruction tuning and alignment of existing LLMs to new languages. By leveraging readily available raw text in the target new languages alongside existing English instruction and preference data, this projects methodological approach circumvents the need for expensive, language-specific dataset creation. Specifically, this project will investigate the efficacy of a promising and novel joint pretraining and alignment strategy, where the LLM is simultaneously exposed to new language data and English instruction data, aiming to dramatically reduce the resource barrier to multilingual LLM development.

Principal Investigator, Research Team Institution & Country