Archivio istituzionale della ricerca dell'Università degli Studi di Palermo

Historically, MISP challenges have focused on audio-visual speech recognition (AVSR), where they have been particularly successful in complex acoustic scenarios. However, even the most sophisticated AVSR systems have been found to have performance limitations. Inspired by traditional robust speech recognition systems, where speech enhancement as a front-end can significantly improve accuracy, the MISP2023 challenge focused on audio-visual target speaker extraction (AVTSE). The primary goal of AVTSE is to enhance speech quality by exploiting the lip movements of the target speaker, thereby improving the final recognition performance. This paper provides a comprehensive overview of the challenge framework, describes the results, and summarizes the effective strategies employed by the contributions. In addition, we analyze the prevailing technical hurdles and provide recommendations for future directions to spur further progress in the AVTSE field.

Chen H., Wu S., Wang C., Du J., Lee C.-H., Siniscalchi S.M., et al. (2024). Summary on the Multimodal Information-Based Speech Processing (MISP) 2023 Challenge. In 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings (pp. 123-124). Institute of Electrical and Electronics Engineers Inc. [10.1109/ICASSPW62465.2024.10627330].

Summary on the Multimodal Information-Based Speech Processing (MISP) 2023 Challenge

Chen H.;Wu S.;Wang C.;Du J.;Lee C. -H.;Siniscalchi S. M.;Watanabe S.;Chen J.;Scharenborg O.;Wang Z. -Q.;Yin B. -C.;Pan J.

2024-01-01

Abstract

Historically, MISP challenges have focused on audio-visual speech recognition (AVSR), where they have been particularly successful in complex acoustic scenarios. However, even the most sophisticated AVSR systems have been found to have performance limitations. Inspired by traditional robust speech recognition systems, where speech enhancement as a front-end can significantly improve accuracy, the MISP2023 challenge focused on audio-visual target speaker extraction (AVTSE). The primary goal of AVTSE is to enhance speech quality by exploiting the lip movements of the target speaker, thereby improving the final recognition performance. This paper provides a comprehensive overview of the challenge framework, describes the results, and summarizes the effective strategies employed by the contributions. In addition, we analyze the prevailing technical hurdles and provide recommendations for future directions to spur further progress in the AVTSE field.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data
	
				2024
			
	ISBN della monografia 
DATO PREVISTO SU LOGINMIUR
	
				979-8-3503-7451-3
			
	DOI del contributo 
DATO PREVISTO SU LOGINMIUR
	
				https://dx.doi.org/10.1109/ICASSPW62465.2024.10627330
			
	URL dell'editore (Open access ove possibile)
	
				https://ieeexplore.ieee.org/document/10627330
			
	Citazione
	
				Chen H.,  Wu S.,  Wang C.,  Du J.,  Lee C.-H.,  Siniscalchi S.M., et al. (2024). Summary on the Multimodal Information-Based Speech Processing (MISP) 2023 Challenge. In 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings (pp. 123-124). Institute of Electrical and Electronics Engineers Inc. [10.1109/ICASSPW62465.2024.10627330].
			
	Appare nelle tipologie:
	
				2.07 Contributo in atti di convegno pubblicato in volume

File in questo prodotto:

File	Dimensione	Formato
Summary_on_the_Multimodal_Information-Based_Speech_Processing_MISP_2023_Challenge.pdf Solo gestori archvio Tipologia: Versione Editoriale Dimensione 810.41 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	810.41 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/663741

Citazioni

ND

3

3

social impact