A new hybrid multimodal model for human interaction recognition

50,000 Awarded Resources (in node hours)

Leonardo Booster System Partition

3 June 2024 - 2 June 2025 Allocation Period

AI Technology: Deep Learning, Vision (image recognition, image generation, text recognition OCR, etc.) and Audio (speech recognition, speech synthesis, etc.)

The goal of the project is to combine CNN robustness and local feature extraction together with the global understanding provided by Transformed-based approaches to improve the model accuracy but requiring fewer computational resources.

Moreover, since multimodal approaches are more robust and provide better accuracy, the team takes advantage of the multiple modalities that can be extracted from video.

Finally, to leverage the usability of our approach in real-life scenarios in which not all modalities can be available all the time (a sensor can fail, or a modality cannot be extracted), the project team applies a missing modality strategy so that the model can deal with missed modalities during test time.

Principal Investigator, Research Team Institution & Country