AI Technology: Vision (image recognition, image generation, text recognition OCR, etc.) | Deep Learning
Foundation models are large, pre-trained architectures that serve as versatile starting points for solving a wide range of specialised tasks. In computer vision, most existing foundation models rely on two-dimensional (2D) image data, enabling capabilities like image recognition, visual question answering and image generation. However, these 2D-based models lack a deep understanding of three-dimensional (3D) spatial relationships, leading to hallucinations and inaccuracies that are particularly problematic in precision-critical applications such as autonomous driving, medical imaging and the physical sciences, where precise spatial perception is required.
This project aims to develop a novel open-source 3D vision foundation model that natively understands 3D motion, structure, and appearance. The model consists of two key modules:
- an image decomposition module that extracts scene maps (e.g., albedo, depth, normals) from images
- a rendering module that can synthesise the original image from these properties, enabling unsupervised training
The project will leverage large-scale data of over 6 million multi-view frames from both synthetic and real-world sources to ensure robustness and generalisability of the model. Ultimately, this model aims to enable more reliable understanding of 3D environments, which will be central to reliable operation of many next-generation technologies in robotics, graphics and the physical sciences.
Ronald Clark, University of Oxford, United Kingdom