Recent advances in language modeling, and AI systems more broadly, have largely been driven by scale (i.e. training larger models on larger datasets). Increased scale leads to increase computational costs and today's leading models are typically trained on supercomputer clusters with large amounts of graphics processing units (GPUs). Standard language model training pipelines are additionally communication-intensive, necessitating high-speed interconnect between GPUs and between GPU nodes.
These requirements make it challenging or impossible for researchers to develop state-of-the-art models unless they have continual access to a high-capacity cluster. Federated learning provides a possible alternative paradigm, where individual workers perform local updates and periodically share updates. By dramatically decreasing the frequency of communication between workers, the total communication cost can be substantially lowered, enabling a form of decentralized training that alleviates the need for centralized compute.
Recent methods like DiLoCo have demonstrated that federated learning can be successfully applied to large language model (LLM) training, but public DiLoCo-based training runs have been relatively small scale. This project will perform cross-datacenter training of a stage-of-the-art multilingual LLM at an unprecedented scale (tens of billions of parameters trained on tens of trillions of tokens) using coordinated allocations across multiple supercomputers.
Prof Colin Raffel, Hugging Face