Scandinavian Language Models

50,000,000

Awarded Resources (in node hours)

Leonardo Booster

System Partition

July 2024 - July 2025

Allocation Period

AI Technology: Generative Language Modeling; Natural Language Processing.

The national libraries of Norway and Sweden collect and preserve nearly everything that is published in their respective languages. Both organizations have used these collections to train and release open access AI models that have seen widespread use with millions of combined downloads.

Through contacts with academia and the public sector, we have periodically surveyed the needs of the organizations who use our models. In our dialogues these organizations have requested updated versions of smaller encoder models that can handle longer context lengths, as well as generative language models that perform more reliably on Swedish, Norwegian, and Danish.

In this project we thus propose to pretrain new long context BERT models for Swedish, Norwegian and Danish. The team further plans to release sentence embedding versions for each of these respective models.

Pretraining larger generative language models requires substantial quantities of data at scales that are infeasible to collect for individual Scandinavian languages. We believe that pooling together text corpora from our largely mutually intelligible languages will benefit the performance of such models for all Scandinavian languages.

As part of this proposal we therefore plan to perform continued pretraining of a ~7B parameter GPT-/Llama-style model on a quality-filtered dataset mixture of English, code and the Scandinavian languages.