MultiSynt: An open multilingual synthetic dataset for LLM pre-training.

9,000,000

Awarded Resources (in node hours)

Leonardo BOOSTER

System Partition

April 2025 - 12 months

Allocation Period

The development of multilingual foundation LLMs with strong generalisation and reasoning capabilities requires diverse, high-quality pre-training data across languages. While English-language resources are abundant, most European languages lack sufficient open pre-training data in both quantity and quality. Current collection efforts cannot fully address this scarcity, limiting representation of many languages in multilingual models. Even well-resourced languages face gaps in diversity and quality of available datasets, hampering the development of effective cross-lingual models. Without addressing these dataset composition deficiencies, we risk producing underperforming models that lack the capabilities needed for effective downstream applications.

This focused project directly supports the broader EuroLLM and OpenEuroLLM initiatives by addressing a critical bottleneck – the availability of high-quality pre-training data – distinct from the large-scale model training requested in our parallel Extreme Scale proposals. This approach uses generative models to enhance existing content, targeting improvements in language representation, domain coverage, and content diversity across EU languages and beyond.

Building on the methodology established by Nemotron-CC[1] for English, to which this project proposes innovative components to overcome some of their weaknesses, a 4-phase approach, for which the use of continued computing access will be crucial, includes:

1) quality estimation of available multilingual pre-training data, including the development of state-of-the-art quality estimation models;

2) experimentation with multilingual synthetic data creation;

3) evaluation of the efficacy of different methods for various languages, including running end-to-end ablation studies by training smaller LLMs;

4) production of larger-scale generation of synthetic data for 40 languages: the 24 official EU languages, 9 candidate-member languages, 3 co-official in member states and some others considered of strategic and economic interest (i.e. Norwegian, Icelandic, etc.).

The project will produce a multilingual dataset created synthetically using already strong generative models prompted to produce further texts in the languages, text types, quantity and quality needed to pre-train strong open and multilingual LLMs for 40 languages. The researchers will assess the dataset through ablation studies, training models of various sizes and evaluating their performance against multilingual benchmarks to provide quantitative evidence of effectiveness. Making this dataset openly available, will improve access to quality pre-training resources for all European languages.

The project combines the efforts and joint collaboration of the two major LLM initiatives for open and transparent AI in Europe, EuroLLM and OpenEuroLLM, which will be led by strong companies and research groups from different corners of Europe. Researcher are drawn from experienced engineers and scientists with expertise in foundation model training, large-scale training datasets, and leveraging high-performance computing infrastructure. Team Members will also include the data engineering leads on behalf of each of the implied SMEs, research institutions and HPC expert partners that participate in this proposal.