The objective of this project is to develop a lidar foundation model capable of promptable 3D segmentation and detection and exhibiting strong generalisation across diverse autonomous driving datasets.
Taking inspiration from the highly successful Segment Anything (SAM) foundation model for image segmentation, the proposed architecture will encompass three key components: (i) a 3D-to-BEV embedding stem specifically designed to address the sensitivity of lidar data to various point sampling patterns, achieved through the projection of the point cloud into a shared embedding space utilising the bird's-eye view (BEV) representation, (ii) a Vision Transformer (ViT) backbone pre-trained on extensive image datasets, and (iii) promptable task-specific heads enabling flexible and intuitive indication of desired segmentation, similar to the SAM model, through prompts like foreground/background points, rough boxes, or free-form text.
The training process will involve two phases: pre-training, where the architecture is distilled using SAM image features, and fine-tuning, refining the model using human annotations and pseudo-labels generated from the SAM model. The requested computational resources will be dedicated to these training procedures.
Ultimately, the outcome of this project will be a versatile lidar model driving advancements in driver-assistance and self-driving systems, bolstering safety, reliability, and innovation within the autonomous driving industry.
Valeo, France.