Archivio istituzionale della ricerca dell'Università degli Studi di Palermo

Previous Multimodal Information based Speech Processing (MISP) challenges mainly focused on audio-visual speech recognition (AVSR) with commendable success. However, the most advanced back-end recognition systems often hit performance limits due to the complex acoustic environments. This has prompted a shift in focus towards the Audio-Visual Target Speaker Extraction (AVTSE) task for the MISP 2023 challenge in ICASSP 2024 Signal Processing Grand Challenges. Unlike existing audio-visual speech enhancement challenges primarily focused on simulation data, the MISP 2023 challenge uniquely explores how front-end speech processing, combined with visual clues, impacts back-end tasks in real-world scenarios. This pioneering effort aims to set the first benchmark for the AVTSE task, offering fresh insights into enhancing the accuracy of back-end speech recognition systems through AVTSE in challenging and real acoustic environments. This paper delivers a thorough overview of the task setting, dataset, and baseline system of the MISP 2023 challenge. It also includes an in-depth analysis of the challenges participants may encounter. The experimental results highlight the demanding nature of this task, and we look forward to the innovative solutions participants will bring forward.

Wu, S., Wang, C., Chen, H., Dai, Y., Zhang, C., Wang, R., et al. (2024). The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction. In IEEE ICASSP [10.1109/icassp48485.2024.10447462].

The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction

Wu, Shilong;Wang, Chenxi;Chen, Hang;Dai, Yusheng;Zhang, Chenyue;Wang, Ruoyu;Lan, Hongbo;Du, Jun;Lee, Chin-Hui;Chen, Jingdong;Siniscalchi, Sabato Marco;Scharenborg, Odette;Wang, Zhong-Qiu;Pan, Jia;Gao, Jianqing

2024-01-01

Abstract

Previous Multimodal Information based Speech Processing (MISP) challenges mainly focused on audio-visual speech recognition (AVSR) with commendable success. However, the most advanced back-end recognition systems often hit performance limits due to the complex acoustic environments. This has prompted a shift in focus towards the Audio-Visual Target Speaker Extraction (AVTSE) task for the MISP 2023 challenge in ICASSP 2024 Signal Processing Grand Challenges. Unlike existing audio-visual speech enhancement challenges primarily focused on simulation data, the MISP 2023 challenge uniquely explores how front-end speech processing, combined with visual clues, impacts back-end tasks in real-world scenarios. This pioneering effort aims to set the first benchmark for the AVTSE task, offering fresh insights into enhancing the accuracy of back-end speech recognition systems through AVTSE in challenging and real acoustic environments. This paper delivers a thorough overview of the task setting, dataset, and baseline system of the MISP 2023 challenge. It also includes an in-depth analysis of the challenges participants may encounter. The experimental results highlight the demanding nature of this task, and we look forward to the innovative solutions participants will bring forward.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data
	
				2024
			
	ISBN della monografia 
DATO PREVISTO SU LOGINMIUR
	
				979-8-3503-4485-1
			
	DOI del contributo 
DATO PREVISTO SU LOGINMIUR
	
				https://dx.doi.org/10.1109/icassp48485.2024.10447462
			
	URL dell'editore (Open access ove possibile)
	
				https://ieeexplore.ieee.org/document/10447462
			
	Citazione
	
				Wu, S., Wang, C., Chen, H., Dai, Y., Zhang, C., Wang, R., et al. (2024). The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction. In IEEE ICASSP [10.1109/icassp48485.2024.10447462].
			
	Appare nelle tipologie:
	
				2.07 Contributo in atti di convegno pubblicato in volume

File in questo prodotto:

File	Dimensione	Formato
The_Multimodal_Information_Based_Speech_Processing_MISP_2023_Challenge_Audio-Visual_Target_Speaker_Extraction.pdf Solo gestori archvio Descrizione: main document Tipologia: Versione Editoriale Dimensione 1.21 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.21 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/638753

Citazioni

ND

11

10

social impact