AI Technology: Machine Learning, Vision (image recognition, image generation, text recognition OCR, etc.)
Object-centric representations have achieved impressive performance on synthetic datasets. A key premise of object-centric representations is that complex real-world scenes can be efficiently and meaningfully summarized in terms of objects for the decision-making process of robotic agents such as self-driving cars.
Leveraging the progress in self-supervised learning, we propose learning efficient object-centric representations from large data collections such as YouTube. This learned representation will be input to the decision-making framework for behavior learning from self-driving datasets in an offline manner.
The team's key motivation is to benefit from structural and compositional object-centric representations while modeling frequently occurring interactions between agents such as negotiation scenarios at the intersections.
Compared to commonly used representation space such as pixels or Bird’s Eye View (BEV), we can significantly improve efficiency by encoding the scene around the agent in terms of compact object-centric representations. In addition to everyday traffic scenarios, our most ambitious goal is to reason about less frequent but still safety-critical scenarios with object representations learned from generic video datasets such as YouTube.
Consider a scenario where a ball is rolling in front of the vehicle. A human driver can anticipate a kid running after it in a few seconds, but current driving solutions cannot even recognize the ball given the lack of diversity in regular driving datasets.
With the object-centric representations learned from YouTube videos, we can not only recognize generic objects that are rare, even non-existent on driving datasets but also incorporate necessary precautions in the decision-making process.
Object-centric representations improve safety and achieve interpretability by decomposing the world into semantically meaningful objects.
In summary, we propose to
- learn object-centric representations from large generic video datasets and transfer them to the driving task;
- relate agents to each other across time and predict the future for each;
- learn how to act based on future-aware object-centric representations.
Fatma Guney, Koç University - Türkiye