AI Technology: Natural Language Processing; Generative Language Modeling; Deep Learning.
This proposal aims to address the challenges of data scarcity for domain-specific fine-tuning of Large Language Models (LLMs) in languages other than English.
The approach involves an end-to-end pipeline where state-of-the-art models generate diverse, task-specific datasets designed to enhance specific model capabilities.
Each dataset will be carefully validated through bias filtering, similarity checks, and quality assessments by an LLM judge. The effectiveness of the generated data will be evaluated through downstream task performance of fine-tuned models.
The resulting synthesized data will be released under open-source licenses, promoting the development of LLMs in underrepresented languages.
Marta Villegas, Barcelona Supercomputing Center, Spain