With the introduction of the transformer neural network (Vaswani et al. 2017) and the subsequent transformer language models (LM) such as GPT (Radford et al. 2018) and BERT (Devlin et al. 2018), they have become the new standard of a pre training-fine tuning paradigm.
This approach leverages large amounts of unannotated text data in a self-supervised pretraining step, producing generalist models that can then be finetuned on specific tasks.
The largest of these models consist of several hundred billion parameters requiring the model itself to be split over multiple GPUs, further needing up to multiple thousand GPUs tobe trained in a reasonable amount of time. Due to the extreme size of the models and their ability to absorb massive datasets, they have shown to be very adept at learning new tasks with only few training examples or even none, instead only needing a prompt describing the task (Brown et al. 2020).
Due to the high financial cost of training these models, most of them have been trained by large companies on English data only. In developing a competitively sized model for Swedish, we hope to enable commercial and non-commercial usage of this technology, while allowing researchers to understand what these models learn not only for English.