AI Technology: Natural Language Processing
Information Retrieval (IR) is crucial for search engines and knowledge discovery, yet current methods struggle with the trade-off between effectiveness and efficiency.
Late-interaction models like ColBERT offer a balance, enabling fine-grained token-level interactions without excessive computational costs. However, existing implementations are outdated and not optimized for modern NLP workloads.This project proposes ModernColBERT, a next-generation retrieval model built on ModernBERT, a state-of-the-art encoder.
Using the Nomic Embed dataset, we aim to train a scalable, efficient, and high-performing ColBERT model, targeting top-tier performance on the Massive Text Embedding Benchmark (MTEB). ModernColBERT will drive advances in frugal, sustainable IR systems, benefiting both research and real-world applications.
Florent Krzakala, EPFL, Switzerland