Skip to main content
The European High Performance Computing Joint Undertaking (EuroHPC JU)

Language Modelling from Pixels

Awarded Resources (in node hours)
Karolina GPU
System Partition
7 July 2023 - 6 July 2024
Allocation Period

Natural language processing plays an increasingly important part of our digital lives but it works best for high-resource languages due to the amount of data needed to train models. This project will create new language process models that can handle any language rendered in an image, based on the insight that written languages share visual similarities, resulting in high-quality models for thousands of languages.

High Performance Computing will play a crucial role in training different variants of these models, each of which needs substantial GPU resources to train large-scale deep neural networks on trillions of tokens.

The highly experimental nature of this project will likely give rise to many model architectures that need to be explored. The major expected outcomes are: (i) knowledge of which model architectures are most suitable for large-scale pixel-based language models, (ii) a collection of high-quality language models that can be readily applied to any language data.