Natural language processing plays an increasingly important part of our digital lives but it works best for high-resource languages due to the amount of data needed to train models. This project will create new language process models that can handle any language rendered in an image, based on the insight that written languages share visual similarities, resulting in high-quality models for thousands of languages.
High Performance Computing will play a crucial role in training different variants of these models, each of which needs substantial GPU resources to train large-scale deep neural networks on trillions of tokens.
The highly experimental nature of this project will likely give rise to many model architectures that need to be explored. The major expected outcomes are: (i) knowledge of which model architectures are most suitable for large-scale pixel-based language models, (ii) a collection of high-quality language models that can be readily applied to any language data.
University of Copenhagen, Denmark