Modern ColBERT

32000

Awarded Resources (in node hours)

MareNostrum 5 ACC

System Partition

May 2025 - May 2026

Allocation Period

AI Technology: Natural Language Processing

Information Retrieval (IR) is crucial for search engines and knowledge discovery, yet current methods struggle with the trade-off between effectiveness and efficiency.

Late-interaction models like ColBERT offer a balance, enabling fine-grained token-level interactions without excessive computational costs. However, existing implementations are outdated and not optimized for modern NLP workloads.This project proposes ModernColBERT, a next-generation retrieval model built on ModernBERT, a state-of-the-art encoder.

Using the Nomic Embed dataset, we aim to train a scalable, efficient, and high-performing ColBERT model, targeting top-tier performance on the Massive Text Embedding Benchmark (MTEB). ModernColBERT will drive advances in frugal, sustainable IR systems, benefiting both research and real-world applications.

Modern ColBERT

Principal Investigator, Research Team Institution & Country