Sustainable Language Modeling through Quantization-Aware Continual Pre-training and Instruction Tuning with Ternary Weights

50000

Awarded Resources (in node hours)

Leonardo BOOSTER

System Partition

January 2025 - January 2026

Allocation Period

AI Technology: Generative Language Modeling & Natural Language Processing.

Large language models (LLMs) require immense resources for training and inference.

Quantization, a technique that reduces the precision of model parameters, offers a promising solution for improving LLM efficiency and sustainability.

While post-training quantization methods typically achieve 4-8 bits per parameter, recent research suggests that training LLMs with 1.58 bits per parameter (three possible values: -1, 0, and 1) from scratch can maintain model accuracy while greatly reducing memory requirements and energy consumption at inference time. However, training 1.58-bit models from scratch requires even more resources than standard training models.

Therefore, the project will investigate quantization-aware continual pre-training as a more efficient alternative to training 1.58-bit LLMs from scratch. In other words, this project will develop methods for seamlessly transitioning standard 16/32-bit LLMs to 1.58-bit precision through continual pre-training – backed by a systematic comparison under fair conditions.

The team will further conduct a seminal investigation of 1.58-bit instruction tuning of language models, to scrutinize its feasibility and its interactions with 1.58-bit pre-training. The research team has expertise in efficient and scalable natural language processing as well as in continual machine learning — which is key for continual pre-training of language models.

The project will use our own existing custom codebases that are based on the OLMo project and the HuggingFace software stack and which have shown scalability on NVIDIA GPU clusters.

Successful completion of this project will contribute to the development of more sustainable and accessible LLMs by substantially reducing the computational and energy requirements of serving LLMs: inference throughput is improved by a factor of 10x, while memory requirements are reduced by a factor of 8x and energy consumption by a factor of 30x.

The project's findings on quantization-aware continual pre-training and instruction fine-tuning will be disseminated through high-impact publications.