Historically, MISP challenges have focused on audio-visual speech recognition (AVSR), where they have been particularly successful in complex acoustic scenarios. However, even the most sophisticated AVSR systems have been found to have performance limitations. Inspired by traditional robust speech recognition systems, where speech enhancement as a front-end can significantly improve accuracy, the MISP2023 challenge focused on audio-visual target speaker extraction (AVTSE). The primary goal of AVTSE is to enhance speech quality by exploiting the lip movements of the target speaker, thereby improving the final recognition performance. This paper provides a comprehensive overview of the challenge framework, describes the results, and summarizes the effective strategies employed by the contributions. In addition, we analyze the prevailing technical hurdles and provide recommendations for future directions to spur further progress in the AVTSE field.
Chen H., Wu S., Wang C., Du J., Lee C.-H., Siniscalchi S.M., et al. (2024). Summary on the Multimodal Information-Based Speech Processing (MISP) 2023 Challenge. In 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings (pp. 123-124). Institute of Electrical and Electronics Engineers Inc. [10.1109/ICASSPW62465.2024.10627330].
Summary on the Multimodal Information-Based Speech Processing (MISP) 2023 Challenge
Wu S.;Siniscalchi S. M.;
2024-01-01
Abstract
Historically, MISP challenges have focused on audio-visual speech recognition (AVSR), where they have been particularly successful in complex acoustic scenarios. However, even the most sophisticated AVSR systems have been found to have performance limitations. Inspired by traditional robust speech recognition systems, where speech enhancement as a front-end can significantly improve accuracy, the MISP2023 challenge focused on audio-visual target speaker extraction (AVTSE). The primary goal of AVTSE is to enhance speech quality by exploiting the lip movements of the target speaker, thereby improving the final recognition performance. This paper provides a comprehensive overview of the challenge framework, describes the results, and summarizes the effective strategies employed by the contributions. In addition, we analyze the prevailing technical hurdles and provide recommendations for future directions to spur further progress in the AVTSE field.File | Dimensione | Formato | |
---|---|---|---|
Summary_on_the_Multimodal_Information-Based_Speech_Processing_MISP_2023_Challenge.pdf
Solo gestori archvio
Tipologia:
Versione Editoriale
Dimensione
810.41 kB
Formato
Adobe PDF
|
810.41 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.