AI Technology: Deep Learning; Vision (image recognition, image generation, text recognition OCR, etc.); Natural Language Processing.
Document Understanding involves analyzing documents to extract and interpret textual content, complex structures, layouts, graphical elements, and handwritten information.
This capability is essential for applications like information retrieval, business process automation, and document-based question answering.Traditional Large Language Models (LLMs) fall short in handling the diverse aspects of documents. Therefore, models designed specifically for Document Understanding are crucial.
These models, such as encoder-decoder transformers, process both text from OCR systems and visual information from document images, using enhancements like 2D embeddings for better layout comprehension.
The project proposes pretraining six Visual-T5 (VT5) models, augmenting the T5 model with layout and image embeddings. This includes three single-page models (VT5-base, VT5-large, VT5-xl) and three multi-page models (MP-VT5-base, MP-VT5-large, MP-VT5-xl). These models will leverage the extensive OCR-IDL dataset and use Layout-Aware Text Denoising for self-supervised learning.
Expected outcomes include achieving state-of-the-art results in Document Understanding, open sourcing the models and code, and providing robust, scalable models for various applications. This initiative aims to advance Document Understanding, fostering innovations and enhancing efficiency in document processing.
Dimosthenis Karatzas, Universidad Autonoma de Barcelona - Spain