Abstract: In recent years, the vision-and-language paradigm has revolutionized the way we learn and rely on computer vision models. A major drawback of learning visual representation has always been the lack of data: when coupling our vision model with large, pre-trained language models we can partially mitigate these issues by building on large amounts of previously learned information. In this talk, we will discuss how using language can i) broaden the comfort zone of model vision models for tasks such as object detection and classification and ii) improve their interpretability. We will go through the basis of the vision-and-language paradigm, highlight some of its inherent limitations and discuss some innovative solutions, for example to make CLIP-like models robust to arbitrary vocabularies selected by the user.
Strada le Grazie 15
37134 Verona
Partita IVA01541040232
Codice Fiscale93009870234
© 2025 | Università degli studi di Verona
******** CSS e script comuni siti DOL - frase 9957 ********