Generation of instruction datasets for Iberian languages

32000

Awarded Resources (in node hours)

MareNostrum 5 ACC

System Partition

May 2025 - May 2026

Allocation Period

AI Technology: Natural Language Processing; Generative Language Modeling; Deep Learning.

This proposal aims to address the challenges of data scarcity for domain-specific fine-tuning of Large Language Models (LLMs) in languages other than English.

The approach involves an end-to-end pipeline where state-of-the-art models generate diverse, task-specific datasets designed to enhance specific model capabilities.

Each dataset will be carefully validated through bias filtering, similarity checks, and quality assessments by an LLM judge. The effectiveness of the generated data will be evaluated through downstream task performance of fine-tuned models.

The resulting synthesized data will be released under open-source licenses, promoting the development of LLMs in underrepresented languages.

Generation of instruction datasets for Iberian languages

Principal Investigator, Research Team Institution & Country