Interactive product retrieval is an emerging research topic with the objective of integrating user inputs from multiple modalities as a query for retrieval. In this presentation, we discuss different solutions for the problem of composing images and language-based or attribute-based modifications for product retrieval in the context of fashion. We present Joint Visual Semantic Matching (JVSM), an unified model that learns image-text compositional embeddings by jointly associating visual and textual modalities in a shared discriminative embedding space via compositional losses. We also propose Visio-linguistic Attention Learning (VAL), a composite transformer that can be seamlessly plugged in a CNN to selectively preserve and transform the visual features conditioned on language semantics at different levels of granularity of information. Finally, we introduce Attribute-Driven Disentangled Encoder (ADDE), a model for disentangled representations based on attribute supervision, and tailor it to attribute manipulation, outfit retrieval and conditional image retrieval.
BIO: Loris is a Principal Computer Vision Scientist at Amazon in Berlin, Germany. He prototypes and develops video understanding models for Amazon Video, novel shopping experiences based on vision and language, innovative solutions in the field of Fashion AI and image-to-text models for improving the accessibility of images for the blind and visually impaired. He obtained his Ph.D. in Computer Science from the University of Verona (Italy) in 2012 supervised by Prof. Vittorio Murino and Prof. Marco Cristani. During his Ph.D., he spent 6 months at the University of British Columbia supervised by Prof. Nando de Freitas. Before the current position, he was a postdoctoral fellow at Dartmouth College working with Prof. Lorenzo Torresani and a postdoctoral fellow at the Italian Institute of Technology working with Prof. Vittorio Murino.
CSS e script comuni siti DOL - frase 9957