Danish Large Language Model for Teaching: Pretraining on Open Data

50,000

Awarded Resources (in node hours)

Leonardo BOOSTER

System Partition

January 2025 - January 2026

Allocation Period

AI Technology: Generative Language Modeling | Natural Language Processing | Deep Learning

This project concerns the pre-training of a foundational multilingual large language model with billion of parameters that excels at Danish. The model is primarily targeted as a foundation model for educational applications around teaching at all levels, starting from primary and secondary school to university level.

The code base has been refined and tested as part of the development project EUHPC_D07_063, which is being completed right now and which has scaled the training to successfully run on up to 64 nodes of the Leonardo Booster module corresponding to 256 GPUs. The collection of high-quality Danish text corpora for the training data has surpassed 100B tokens. The Danish data is supplemented by the multilingual open GDPR-conform dataset Common Corpus consisting of 2T tokens.