Extending protein language models beyond the curated protein space

50,000

Awarded Resources (in node hours)

Leonardo BOOSTER

System Partition

November 2024 - November 2025

Allocation Period

AI Technology: Natural Language Processing

This project introduces MatMulFreePLM, a protein language model that overcomes the performance plateau of existing models like ESM by expanding and diversifying training datasets while optimizing computational efficiency.

Traditional approaches to enhance PLMs by scaling model parameters have yielded marginal gains due to limitations like reliance on clustered datasets such as UniRef50. By incorporating a broader spectrum of protein sequences, this project enriches the model's ability to capture protein patterns. To tackle the computational challenges posed by larger datasets and the quadratic complexity of transformer attention mechanisms, the project adapts the Performer architecture to eliminate explicit matrix multiplication using ternary weights, reducing computational overhead by 40–50%. This enables training on extensive datasets ranging from 522 million to 2.3 billion sequences without additional computational costs.

Preliminary experiments show improved learning efficiency and faster convergence, indicating that focusing on data expansion and architectural optimization, rather than model scaling alone, can significantly advance PLM capabilities. This advancement holds substantial potential for computational biology, drug discovery, and protein engineering by enabling more accurate protein analysis and accelerating the development of novel therapeutics.