Visual Explainability and Robustness through Language

Speaker:  Riccardo Volpi - Naver Labs Europe - France
  Tuesday, June 4, 2024 at 1:45 PM Aula C (solo presenza)

Abstract: In recent years, the vision-and-language paradigm has revolutionized the way we learn and rely on computer vision models. A major drawback of learning visual representation has always been the lack of data: when coupling our vision model with large, pre-trained language models we can partially mitigate these issues by building on large amounts of previously learned information. In this talk, we will discuss how using language can i) broaden the comfort zone of model vision models for tasks such as object detection and classification and ii) improve their interpretability. We will go through the basis of the vision-and-language paradigm, highlight some of its inherent limitations and discuss some innovative solutions, for example to make CLIP-like models robust to arbitrary vocabularies selected by the user.

Programme Director
Vittorio Murino

External reference
Publication date
April 18, 2024