In this paper we discuss the rational of the Multi-model Information based Speech Processing (MISP) Challenge, and provide a detailed description of the data recorded, the two evaluation tasks and the corresponding baselines, followed by a summary of submitted systems and evaluation results. The MISP Challenge aims at tack-ling speech processing tasks in different scenarios by introducing information about an additional modality (e.g., video, or text), which will hopefully lead to better environmental and speaker robustness in realistic applications. In the first MISP challenge, two bench-mark datasets recorded in a real-home TV room with two reproducible open-source baseline systems have been released to promote research in audio-visual wake word spotting (AVWWS) and audio-visual speech recognition (AVSR). To our knowledge, MISP is the first open evaluation challenge to tackle real-world issues of AVWWS and AVSR in the home TV scenario.

Chen, H., Zhou, H., Du, J., Lee, C., Chen, J., Watanabe, S., et al. (2022). The First Multimodal Information Based Speech Processing (Misp) Challenge: Data, Tasks, Baselines And Results. In 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 9266-9270) [10.1109/ICASSP43922.2022.9746683].

The First Multimodal Information Based Speech Processing (Misp) Challenge: Data, Tasks, Baselines And Results

Siniscalchi, Sabato Marco
Supervision
;
2022-01-01

Abstract

In this paper we discuss the rational of the Multi-model Information based Speech Processing (MISP) Challenge, and provide a detailed description of the data recorded, the two evaluation tasks and the corresponding baselines, followed by a summary of submitted systems and evaluation results. The MISP Challenge aims at tack-ling speech processing tasks in different scenarios by introducing information about an additional modality (e.g., video, or text), which will hopefully lead to better environmental and speaker robustness in realistic applications. In the first MISP challenge, two bench-mark datasets recorded in a real-home TV room with two reproducible open-source baseline systems have been released to promote research in audio-visual wake word spotting (AVWWS) and audio-visual speech recognition (AVSR). To our knowledge, MISP is the first open evaluation challenge to tackle real-world issues of AVWWS and AVSR in the home TV scenario.
2022
Settore ING-INF/05 - Sistemi Di Elaborazione Delle Informazioni
978-1-6654-0540-9
Chen, H., Zhou, H., Du, J., Lee, C., Chen, J., Watanabe, S., et al. (2022). The First Multimodal Information Based Speech Processing (Misp) Challenge: Data, Tasks, Baselines And Results. In 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 9266-9270) [10.1109/ICASSP43922.2022.9746683].
File in questo prodotto:
File Dimensione Formato  
The_First_Multimodal_Information_Based_Speech_Processing_Misp_Challenge_Data_Tasks_Baselines_And_Results.pdf

Solo gestori archvio

Tipologia: Versione Editoriale
Dimensione 864.94 kB
Formato Adobe PDF
864.94 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/636618
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 26
  • ???jsp.display-item.citation.isi??? 6
social impact