Skip to main content
The European High Performance Computing Joint Undertaking (EuroHPC JU)

Generation of instruction datasets for Iberian languages

32000
Awarded Resources (in node hours)
MareNostrum 5 ACC
System Partition
May 2025 - May 2026
Allocation Period

AI Technology: Natural Language Processing; Generative Language Modeling; Deep Learning.

This proposal aims to address the challenges of data scarcity for domain-specific fine-tuning of Large Language Models (LLMs) in languages other than English. 

The approach involves an end-to-end pipeline where state-of-the-art models generate diverse, task-specific datasets designed to enhance specific model capabilities.

Each dataset will be carefully validated through bias filtering, similarity checks, and quality assessments by an LLM judge. The effectiveness of the generated data will be evaluated through downstream task performance of fine-tuned models. 

The resulting synthesized data will be released under open-source licenses, promoting the development of LLMs in underrepresented languages.