AI Technology: Deep Learning, Vision (image recognition, image generation, text recognition OCR, etc.) and Audio (speech recognition, speech synthesis, etc.)
The goal of the project is to combine CNN robustness and local feature extraction together with the global understanding provided by Transformed-based approaches to improve the model accuracy but requiring fewer computational resources.
Moreover, since multimodal approaches are more robust and provide better accuracy, the team takes advantage of the multiple modalities that can be extracted from video.
Finally, to leverage the usability of our approach in real-life scenarios in which not all modalities can be available all the time (a sensor can fail, or a modality cannot be extracted), the project team applies a missing modality strategy so that the model can deal with missed modalities during test time.
Nicolas Guil, University of Malaga - Spain