Skip to main content
Logo
The European High Performance Computing Joint Undertaking (EuroHPC JU)

chaRNABert: A character level RNA foundation model

50,000
Awarded Resources (in node hours)
Leonardo Booster
System Partition
3 June 2024 - 2 June 2025
Allocation Period

AI Technology: Natural Language Processing, Deep Learning

 

The project seeks to develop an advanced RNA language model using cutting-edge artificial intelligence to decode the sequences of both coding and non-coding RNA. Built on a BERT-like, encoder-only transformer architecture, the model aims to deepen our understanding of RNA's complex structures and functions.

This effort marks a significant leap forward in computational biology by filling existing gaps in RNA modeling and establishing new standards in the field. 

The project's method employs innovative techniques such as Gradient-based Subword Tokenization (GBST) and Recursive InterNetwork Group (RING) attention mechanisms. GBST dynamically tokenizes RNA sequences, enabling the model to identify and learn from the most informative segments effectively. 

Concurrently, RING attention allows the model to detect intricate patterns and relationships within RNA sequences, significantly enhancing its predictive accuracy and ability to generalize across diverse RNA families and structures.

This project is designed to create a comprehensive RNA language model that supports a wide range of RNA types, providing a foundational tool for both computational and experimental biologists to explore and answer complex biological questions.