Skip to main content
The European High Performance Computing Joint Undertaking (EuroHPC JU)

Danish Large Language Model for Teaching: Pretraining on Open Data

50,000
Awarded Resources (in node hours)
Leonardo BOOSTER
System Partition
January 2025 - January 2026
Allocation Period

AI Technology: Generative Language Modeling | Natural Language Processing | Deep Learning 

This project concerns the pre-training of a foundational multilingual large language model with billion of parameters that excels at Danish. The model is primarily targeted as a foundation model for educational applications around teaching at all levels, starting from primary and secondary school to university level. 

The code base has been refined and tested as part of the development project EUHPC_D07_063, which is being completed right now and which has scaled the training to successfully run on up to 64 nodes of the Leonardo Booster module corresponding to 256 GPUs. The collection of high-quality Danish text corpora for the training data has surpassed 100B tokens. The Danish data is supplemented by the multilingual open GDPR-conform dataset Common Corpus consisting of 2T tokens.