PRISM proposes a novel approach to training large-scale protein language models by decoupling token embeddings from the transformer body and specialising tokenisation per biological dataset. Unlike current methods that impose a shared vocabulary across curated and metagenomic data, PRISM utilises Gradient-Based Subword Tokenisation (GBST) to dynamically learn optimal subword structures within each corpus, such as UniProt, MAGnify, JGI, and GEMS. This architecture minimises vocabulary dilution, negative interference, and model oversquashing, enabling efficient, federated pretraining of a single transformer body while retaining per-silo embedding specialisation.
PRISM’s innovation extends to a hybrid pretraining loss that combines span corruption to enhance structural motif learning, with cross-silo contrastive learning, to encourage dataset-invariant representations. Following decentralised pretraining, PRISM reconstructs a unified global tokeniser and embedding matrix, enabling seamless inference and fine-tuning on unseen protein sequences from any biological origin.
This project will leverage the compute capabilities of PRACE HPC infrastructure to orchestrate communication-efficient training across thousands of GPUs, minimising synchronisation overheads and memory bottlenecks. The resulting model will enable high-fidelity protein representation across environmental, clinical, and synthetic biology domains, setting a new standard in protein language modeling. PRISM directly addresses major scientific challenges in functional annotation, structure prediction, and variant effect inference from previously inaccessible metagenomic and environmental sequence data, with broad implications for biotechnology, healthcare, and sustainable bioengineering.
Alexis Molina Martinez de Los Reyes, Nostrum Biodiscovery, Spain