- Autori:
-
Appiani, Andrea; Beyan, Cigdem
- Titolo:
-
VAD-CLVA: Integrating CLIP with LLaVA for Voice Activity Detection
- Anno:
-
2025
- Tipologia prodotto:
-
Articolo in Rivista
- Tipologia ANVUR:
- Articolo su rivista
- Lingua:
-
Inglese
- Referee:
-
No
- Nome rivista:
- INFORMATION
- ISSN Rivista:
- 2078-2489
- N° Volume:
-
16
- Numero o Fascicolo:
-
3
- Intervallo pagine:
-
1-20
- Parole chiave:
-
active speaker; voice activity detection; social interactions; vision language models; generative multimodal models; panel discussions
- Breve descrizione dei contenuti:
- Voice activity detection (VAD) is the process of automatically determining whether a person is speaking and identifying the timing of their speech in an audiovisual data. Traditionally, this task has been tackled by processing either audio signals or visual data, or by combining both modalities through fusion or joint learning. In our study, drawing inspiration from recent advancements in visual-language models, we introduce a novel approach leveraging Contrastive Language-Image Pretraining (CLIP) models. The CLIP visual encoder analyzes video segments focusing on the upper body of an individual, while the text encoder processes textual descriptions generated by a Generative Large Multimodal Model, i.e., the Large Language and Vision Assistant (LLaVA). Subsequently, embeddings from these encoders are fused through a deep neural network to perform VAD. Our experimental analysis across three VAD benchmarks showcases the superior performance of our method compared to existing visual VAD approaches. Notably, our approach outperforms several audio-visual methods despite its simplicity and without requiring pretraining on extensive audio-visual datasets.
- Id prodotto:
-
145557
- Handle IRIS:
-
11562/1161480
- ultima modifica:
-
9 maggio 2025
- Citazione bibliografica:
-
Appiani, Andrea; Beyan, Cigdem,
VAD-CLVA: Integrating CLIP with LLaVA for Voice Activity Detection
«INFORMATION»
, vol.
16
, n.
3
,
2025
,
pp. 1-20
Consulta la scheda completa presente nel
repository istituzionale della Ricerca di Ateneo