Skip to main content
Logo
The European High Performance Computing Joint Undertaking (EuroHPC JU)

Leveraging high quality internal data of the European Institutions at scale to build an EU institutional large language model (LLM)

50,000
Awarded Resources (in node hours)
Leonardo Booster
System Partition
July 2024 - July 2025
Allocation Period

AI Technology: Generative Language Modeling

 

The European Commission's DG Translation (DGT) is proposing to create a new state-of-the-art large language model (LLM) with better language capabilities across all EU official languages and in the EU specific domain.

DGT is proposing to do that at scale by using data from Euramis, the EU Institutions’ internal large database of high quality multilingual EU institutional text data, to continue the pre-training of a state-of-the-art large language model (i.e. Llama 3 70B).

While proprietary and third-party LLMs offer powerful capabilities, they often fall short in terms of language diversity, data quality, copyright safety, transparency, and freedom from bias. 

The data used to train these models can be of inconsistent quality, which affects the model's performance and reliability. In particular, most LLMs support only a limited number of languages and severely underperform on low-resource languages, which can be a significant drawback for organisations that need multilingual support. This limitation is particularly relevant for European AI projects that require a broad range of EU languages.

To address these concerns, the proposed project would continue the pre-training of the state-of-the-art Llama 3 70B LLM at scale with data from the interinstitutional Euramis database kept by DGT. This inter-institutional database contains over 100 billion tokens from professional translations done within the EU Institutions for documents (legislative, administrative and policy-related ) drafted within the EU Institutions, making the data very high quality compared to data collected (crawled) and processed automatically from sources with variable quality. It is safe in terms of copyright, and it does not contain disinformation or other non-factual or unreliable information. It includes significant representation for low-resource languages, which can address the challenge of language coverage in existing LLMs.

The proposed project would produce a new state-of-the-art LLM with improved language capabilities across all EU official languages and in the EU specific domain. The new model would be used to power existing AI-based services through the Digital Europe platform run by DGT, and made available under the European Commission’s open-sourcing policy for use in other projects in Europe. A new state-of-the-art model with better EU language capabilities and better EU knowledge and style will be better tailored for use cases relevant in particular for EU public administrations, SMEs, civil society, and academia.

DGT is at the forefront of introducing AI-driven solutions within the European public sector, offering cutting-edge AI services under the Digital Europe programme. Key initiatives include eTranslation for neural machine translation, eSummary for automated multilingual summarisation, and eBriefing for creating specific draft documents using generative AI. 

These innovations are accessible to government bodies, educational institutions, non-profit organisations, and small and medium-sized enterprises (SMEs) across EU Member States. In the past, our team successfully completed two supercomputing projects through EuroHPC using MeluXina-GPU (project EHPC-DEV-2022D09-008 and EHPC-DEV-2023D09-008). 

The work included continued pre-training of a smaller open source LLM (Llama 2 13B) and only for the Slovenian and Croatian languages. Additionally, in a new ongoing project (EHPC-DEV-2024D05-041) the team plan to optimise the training software to the Leonardo Booster hardware, which would feed into this proposal's efficient training of Llama 3 70B on the Euramis database at scale.