EuroLLM

420,000

Awarded Resources (in node hours)

MareNostrum 5 ACC

System Partition

1 May 2024 - 30 April 2025

Allocation Period

Large language models (LLMs) are at the forefront of enormous progress in natural language processing and AI, as witnessed by models like OpenAI's ChatGPT. Recently, semi-open LLMs have become available (BLOOM, LLaMA, MPT, Pythia), but they are limited mostly to English and a few high-resource languages, excluding many European languages (for example, BLOOM encompasses 46 languages, but German is not among them.).

The most powerful models are owned by large corporations with piecemeal commitment to open science, and other modalities (e.g. speech) lag behind text. Running these models in-house is either impossible or extremely expensive: a serious obstacle for research and innovation in Europe.

The goal of EuroLLM is to train LLMs with open data and reproducibility, covering speech and text in all European languages. Efficient versions will be released, enabling the ecosystem of European researchers and SMEs to create new research and products. We aim to release model versions in 4 sizes: 7B, 30B, 65B, and 200B, as well as a distilled 7B model.

The largest of these models surpasses the size of the GPT-3 family (175B). Compared to BLOOM, our models will be trained on open multilingual data on 80 languages (including all official European languages), they will leverage parallel data, they will be multimodal (text and speech), and a distilled version will be facilitated. Our team comprises 7 research centres and 3 SMEs, currently collaborating on two Horizon Europe projects and one ERC-funded project.