AI Technology: Generative Language Modeling | Natural Language Processing | Deep Learning
This project concerns the pre-training of a foundational multilingual large language model with billion of parameters that excels at Danish. The model is primarily targeted as a foundation model for educational applications around teaching at all levels, starting from primary and secondary school to university level.
The code base has been refined and tested as part of the development project EUHPC_D07_063, which is being completed right now and which has scaled the training to successfully run on up to 64 nodes of the Leonardo Booster module corresponding to 256 GPUs. The collection of high-quality Danish text corpora for the training data has surpassed 100B tokens. The Danish data is supplemented by the multilingual open GDPR-conform dataset Common Corpus consisting of 2T tokens.
Peter Schneider-Kamp, University of Southern Denmark, Denmark