Training of pre-trained streaming speech recognition models

39,700

Awarded Resources (in node hours)

LUMI-G

System Partition

13 November 2023 - 12 November 2024

Allocation Period

This project is directly connected to the European Commission CNECT/LUX/2022/OP/0030 Lot 2 (Digital Europe programme - creation of open source European language speech recognition solutions), 12/2022-12/2024. It shall deliver several "EU-based" large pre-trained models which will be publicly available for research and commercial use by public bodies and SMEs.

In contrast to the existing pre-trained models, this project aims at two new directions:

The actual world-wide used models are not fully traceable (not all training data is disclosed) or the training data licensing is not in compliance with the upcoming EU AI and privacy regulations. Data privacy is one of the main future targets of the EU and having an “EU based” large pre-trained model is important for the future.
The actual pre-trained models are not streaming. Running actual models in streaming scenarios is suboptimal and can be considered as energetically ineffective.

The 1) is addressed by training solely on open public data compliant with EU regulations (mainly European Parliament and other public sources). We expect to use up to 400k hours of speech for training. The 2) is addressed by using masking in the attention mechanism for example.

The time frame is 14 months (models delivery is on November 2024).