This open, European model sets the standard for multilingual AI innovation and aims to act as a driver for enterprise, experimentation and exchange.
In this interview, EuroHPC JU chats with Research Scientist, Nuno Guerreiro, and Fractional Communications Director, Farah Pasha, about the model's development and their plans for the future.
EuroLLM Team includes: Dr.José Pombal, Dr. Duarte Alves, Dr Nuno Guerreiro, Prof. Andre Martins, Dr Pedro Martins, Dr João Alves, Dr. Patrick Fernandes
Tell us about your team.
The EuroLLM Team brings together some of the brightest minds in AI including Unbabel, Instituto Tecnico Lisbon, the University of Edinburgh, Instituto de Telecommunicacoes, Université Paris-Saclay, Aveni, Sorbonne University, Naver Labs, and the University of Amsterdam. United by a shared vision, this project team is committed to advancing multilingual AI technologies that empower Europe’s digital future. With a focus on strengthening Europe’s digital sovereignty, the team is developing solutions that reflect the EU’s commitment to AI for innovation, offering anyone the opportunity to use this EU homegrown LLM and build upon it. The project is living proof that amazing things can happen when Europe comes together to push the boundaries of innovation.
Tell us the motivation behind creating an LLM for the 24 official languages of the EU?
One of the key challenges we faced was the overwhelming dominance of the English language in training data, which often sidelines the richness and diversity of other languages. As a starting point most LLMs, like ChatGPT or Meta’s Llama, are English language centric and trained with English language data sets. As a result, they tend to be much better at representing English speaking populations. It was easy to see a gap within the LLM landscape, a lack of solid multilingual solutions, most especially the ability to represent the full set of 24 European languages. This challenge became both a concern and our driving force in developing a truly multilingual LLM.
What were the main challenges you experienced creating an LLM for these 24 European languages?
Creating a ‘balanced model’ was a priority and challenge for us. We had to consider all of the steps including how data is passed to the model. Existing LLMs usually overcharge languages with lower representative data sets because the pieces of text the model consumes are much bigger. For example, it is estimated that Greek text uses 5 to 6 times more tokens (words, characters, phrases) than English text, hence the cost is also 5 to 6 times higher. This scenario creates an unfair cost burden for less mainstream languages, so we made sure our LLM doesn’t carry that penalty forward.
To overcome the lack of language diversity in LLMs, we had to put lots of care into filtering data for other European languages. We had to strike a balance in, curating a proportion of data that maintained model quality, while ensuring fair representation for all languages including under-represented languages like Maltese and Irish. Our model utilises a dataset of trillions of tokens sourced from different languages.
We set out to create a system that performs effectively across multiple languages, which meant carefully balancing the mix of training data. We had to decide how much focus to give each language, and the role English would play in the training process. We did not want a model that was overwhelmingly English-centric like most existing open alternatives (these are typically trained on 80%-90% English language data). We developed scaling laws to predict the performance of different blends and settled with 50% English data. We are confident that with this mix the model performs well for other languages without compromising on its performance in English.
Who did you envision as your audience/users?
We wanted the model to be used by everyone, so we kept it as general as possible. We also didn’t want to constrain the model by training it to be particularly good at a limited number of tasks. EuroLLM works across several use cases like translation apps, chatbots, or even virtual assistants. For example, a business could use it to build a customer support bot that pulls information straight from its product catalogue and answers instantly, in any European language.
We made our model open source so that it can be adapted as needed and others can build on top. That was especially important to us because it is a multilingual tool and most models are closed and do not represent the European languages well enough.
We are really proud that our LLM is currently trending on Hugging Face which is a platform where the machine learning community collaborates on models, datasets and applications. EuroLLM 9B amassed more than 60,000 downloads in the weeks after it was launched which is impressive in such a crowded space, particularly with American and Chinese made models. We want to keep that momentum going for our European model!
Speaking of European made models, how important was the support of EuroHPC JU to the development of your project?
Our model is built from scratch on extensive training data on MareNostrum5 at the Barcelona Supercomputing Center leveraging the advanced European HPC infrastructure for large-scale training. Europe is the only continent in the world to have a large public network of supercomputers that are managed by the EuroHPC Joint Undertaking (EuroHPC JU). As soon as we received the EuroHPC JU access to the supercomputer, we were ready to roll up our sleeves and get to work. We developed the small model right away and in less than 6 months the second model was ready. Now we are using the remaining time on the supercomputer to build a third model. The EuroHPC JU Extreme Scale Access support was a major accelerant to getting the project to where it is. It would not have been possible without it.
Your translation tool Widn combines accuracy and personalisation making the content it produces more authentic. Can you explain a little bit more about your aims and design approach for Widn?
Most existing tools are monotonic translators, so they convert word-for-word the text you feed them. We wanted Widn to produce something less literal and to prioritise capturing nuance, cultural difference and the ‘human’ side of language, while still ensuring a high level of accuracy. Widn allows users to customize their content so it can be aligned with a company or organisation’s branding, style or messaging.
Our solution is built to focus on meaning. We didn’t want the essence of what was being said to get lost in translation. Instead of defaulting to a “model answer”, Widn aims for something far closer to the complexity and nuance of real human communication.
What are the future plans of EuroLLM?
Right now, we are advancing AI for translation at pace using our remaining computational hours on Marenostrum 5. We are focused on vision and speech but facing similar barriers in that so much vision and speech data is paired with English text. Our aim is to build a multilingual and multimodal speech and vision model for all of the European languages so we can offer a best-in-class solution to customers and businesses across Europe. We want to continue to expand our text-only capabilities, but adding images and speech will produce a much more holistic model and experience for users. We are especially proud to be contributing to the AI ambitions of the European Union by building the best technologies possible and becoming leaders in AI research and development.
The future is bright, and we’re just getting started!