Cross-Facility Federated Learning

32,000

Awarded Resources (in node hours)

MareNostrum5 ACC

System Partition

November 2024 - November 2025

Allocation Period

AI Technology: Generative Language Modeling

Large language models (LLMs) are the current crucial, state-of-the-art workload for high-performance computers (HPCs) due to their enormous potential impact on the society and economics of the most advanced nations worldwide. Big Tech rushes to acquire the most advanced hardware components necessary to build private, special-purpose computing clusters with one single aim: research and pre-train the next most successful (and most remunerative) LLM. Such a rush is rapidly scaling orders of magnitude in multiple, correlated directions: LLMs' size (now aiming at trillion parameter models), FLOPs required for their training (now around 1e+24), data centres sizes (hundreds of thousands of cutting-edge GPUs), and capital expenditure to sustain all of this (Microsoft and OpenAI announced a plan for a $100 billion data centre project in early 2024).

It seems thus that, due to a lack of the necessary resources to compete, academia and SMEs will be relegated to a marginal role in LLM research and development. This project advocates for a new take on using publicly available computing power in general, and on LLM pre-training in particular, to bridge this computational divide, cross-Facility Federated Learning (xFFL). xFFL leverages federated learning (FL) as an enabling technique to exploit the joint computing power of geographically distributed HPCs to carry out a large-scale computation (e.g., LLMs pre-training) without incurring in the single HPC infrastructure's constraints. Such a cross-facility approach is modelled as a workflow and can be extended outside the LLM use case. Specifically, this project aims to prove the capabilities of the xFFL approach through a large-scale, multi-Top500 HPC pre-training of two state-of-the-art LLMs, LLaMA and Mixtral. Many smaller-scale experiments have already been completed, proving the maturity of the proposed software stack and selected tools (i.e. specifically designed Singularity containers and the StreamFlow workflow management system).