AI Technology: Vision (image recognition, image generation, text recognition OCR, etc.)
Video diffusion models aim to generate or improve video content by leveraging extensive datasets of genuine videos to produce sequences that are both realistic and logically consistent, frame by frame.
These models are computationally demanding due to their requirement to handle and synthesize high-dimensional data across spatial and temporal dimensions, necessitating significant memory and processing capabilities to navigate complex dynamics and maintain consistency across sequences.
However, their significance in research and industrial spheres, such as movie production, video game development, and virtual reality, cannot be overstated. Creating lifelike video content enhances realism, immersion, and the user experience, driving innovation in digital media interaction.
This research proposal outlines the development of a transformer-based methodology for generating long videos through diffusion modeling. Initially, we propose using a causal encoder to compress images and videos into a shared latent space, facilitating cross-modality training and generation.
To improve memory use and training efficiency, the research team will design an architecture with window attention, specifically tailored for combined spatial and spatiotemporal generative modeling. Furthermore, the project team will explore a novel approach to efficiently train large-scale video diffusion models using masked modeling.
Although masked transformers have been widely investigated within the realm of vision representation learning, their potential for long video generative learning remains largely untapped. This project aims to fill this gap by introducing an effective masked modeling strategy tailored for extensive video diffusion modeling via transformer architecture.
Volker Tresp, Ludwig Maximilian University of Munich (LMU) - Germany