A 3D-Grounded Foundation Model for Robotics, Graphics, and Beyond

32,000

Awarded Resources (in node hours)

MareNostrum5 ACC

System Partition

January 2025 - January 2026

Allocation Period

AI Technology: Vision (image recognition, image generation, text recognition OCR, etc.) | Deep Learning

Foundation models are large, pre-trained architectures that serve as versatile starting points for solving a wide range of specialised tasks. In computer vision, most existing foundation models rely on two-dimensional (2D) image data, enabling capabilities like image recognition, visual question answering and image generation. However, these 2D-based models lack a deep understanding of three-dimensional (3D) spatial relationships, leading to hallucinations and inaccuracies that are particularly problematic in precision-critical applications such as autonomous driving, medical imaging and the physical sciences, where precise spatial perception is required.

This project aims to develop a novel open-source 3D vision foundation model that natively understands 3D motion, structure, and appearance. The model consists of two key modules:

an image decomposition module that extracts scene maps (e.g., albedo, depth, normals) from images
a rendering module that can synthesise the original image from these properties, enabling unsupervised training

The project will leverage large-scale data of over 6 million multi-view frames from both synthetic and real-world sources to ensure robustness and generalisability of the model. Ultimately, this model aims to enable more reliable understanding of 3D environments, which will be central to reliable operation of many next-generation technologies in robotics, graphics and the physical sciences.