A universal approach for video understanding tasks with language grounding

35,000

Awarded Resources (in node hours)

LUMI-G

System Partition

January 2025 - January 2026

Allocation Period

AI Technology: Vision (image recognition, image generation, text recognition OCR, etc.) | Natural Language Processing

The goal of this project is to design a framework that unifies multiple video grounding tasks. The aim is to develop a video-language model flexible to take different type of inputs and produce different type of outputs for various spatio-temporal grounding problems. The core idea of the proposed model is that an LLM is used both to understand the requirements of each task and to generate predictions as a sequence, and a visual grounding model ingests language embeddings that guide the detection of the language components in the video.

While existing models excel at semantic tasks like captioning, they perform poorly on spatial-temporal reasoning, which is critical for areas such as robotics and self-driving cars. To address this, the project proposes a model that unifies various video grounding tasks—such as spatio-temporal grounding and referring expression comprehension—offering a more comprehensive approach than current models that typically focus on isolated tasks.

The project can have significant impact in various disciplines. For example, it can aid robots to manipulate objects given instructions in natural language, or it can enhance self-driving cars with grounded conversation capabilities between the driver and the car for more effective and safe driving. Finally, it can enhance video retrieval with a more fine-grained approach, providing spatio-temporal locations in the video for user queries.