AI Technology: Vision (image recognition, image generation, text recognition OCR, etc.); Deep Learning.
The recent introduction of Visual AutoRegressive modeling (VAR) offers a transformative approach to autoregressive learning on images, employing a novel coarse-to-fine strategy for "next-scale prediction" or "next-resolution prediction" that diverges from traditional raster-scan "next-token prediction."
This methodology has demonstrated the capability to enhance the learning speed and generalization of autoregressive transformers, enabling VAR to surpass diffusion transformers in terms of image generation performance. The project aims to extend the application of VAR extending the class-guided textual prompt mechanism of VAR to textual inputs.
This extension will harness the foundational strengths of VAR—its exceptional improvements in metrics such as the Fréchet Inception Distance (FID) and the Inception Score (IS), alongside its 20x faster inference speed—to tailor and refine image generation further according to versatile textual descriptions.
By incorporating textual prompts, our extended text-VAR model (t-VAR) will not only inherit VAR’s demonstrated capabilities in improving image quality, inference speed, data efficiency, and scalability but also add a layer of adaptability and specificity to the generation process.
This will potentially unlock more sophisticated and targeted applications in image generation, such as personalized content creation and dynamic media adaptation.
The team's approach aims to utilize the proven scaling laws and to elevate its utility in practical, user-defined scenarios, thereby broadening the horizon for visual generation and unified learning applications.
Dimosthenis Karatzas, Universitat Autonoma de Barcelona - Spain