Archivio istituzionale della ricerca dell'Università degli Studi di Palermo

The Multi-modal Information based Speech Processing (MISP) challenge aims to extend the application of signal processing technology in specific scenarios by promoting the research into wake-up words, speaker diarization, speech recognition, and other technologies. The MISP2022 challenge has two tracks: 1) audio-visual speaker diarization (AVSD), aiming to solve "who spoken when"using both audio and visual data; 2) a novel audio-visual diarization and recognition (AVDR) task that focuses on addressing "who spoken what when"with audio-visual speaker diarization results. Both tracks focus on the Chinese language, and use far-field audio and video in real home-tv scenarios: 2-6 people communicating each other with TV noise in the background. This paper introduces the dataset, track settings, and baselines of the MISP2022 challenge. Our analyses of experiments and examples indicate the good performance of AVDR baseline system, and the potential difficulties in this challenge due to, e.g., the far-field video quality, the presence of TV noise in the background, and the indistinguishable speakers.

Wang, Z., Wu, S., Chen, H., He, M., Du, J., Lee, C., et al. (2023). The Multimodal Information Based Speech Processing (Misp) 2022 Challenge: Audio-Visual Diarization And Recognition. In IEEE ICASSP [10.1109/icassp49357.2023.10094836].

The Multimodal Information Based Speech Processing (Misp) 2022 Challenge: Audio-Visual Diarization And Recognition

Wu, Shilong;Chen, Hang;He, Mao-Kui;Du, Jun;Lee, Chin-Hui;Chen, Jingdong;Watanabe, Shinji;Siniscalchi, Sabato^Supervision;Scharenborg, Odette;Liu, Diyuan;Yin, Baocai;Pan, Jia;Gao, Jianqing;Liu, Cong

2023-01-01

Abstract

The Multi-modal Information based Speech Processing (MISP) challenge aims to extend the application of signal processing technology in specific scenarios by promoting the research into wake-up words, speaker diarization, speech recognition, and other technologies. The MISP2022 challenge has two tracks: 1) audio-visual speaker diarization (AVSD), aiming to solve "who spoken when"using both audio and visual data; 2) a novel audio-visual diarization and recognition (AVDR) task that focuses on addressing "who spoken what when"with audio-visual speaker diarization results. Both tracks focus on the Chinese language, and use far-field audio and video in real home-tv scenarios: 2-6 people communicating each other with TV noise in the background. This paper introduces the dataset, track settings, and baselines of the MISP2022 challenge. Our analyses of experiments and examples indicate the good performance of AVDR baseline system, and the potential difficulties in this challenge due to, e.g., the far-field video quality, the presence of TV noise in the background, and the indistinguishable speakers.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data
	
				2023
			
	ISBN della monografia 
DATO PREVISTO SU LOGINMIUR
	
				978-1-7281-6327-7
			
	DOI del contributo 
DATO PREVISTO SU LOGINMIUR
	
				https://dx.doi.org/10.1109/icassp49357.2023.10094836
			
	Citazione
	
				Wang, Z., Wu, S., Chen, H., He, M., Du, J., Lee, C., et al. (2023). The Multimodal Information Based Speech Processing (Misp) 2022 Challenge: Audio-Visual Diarization And Recognition. In IEEE ICASSP [10.1109/icassp49357.2023.10094836].
			
	Appare nelle tipologie:
	
				2.07 Contributo in atti di convegno pubblicato in volume

File in questo prodotto:

File	Dimensione	Formato
The_Multimodal_Information_Based_Speech_Processing_Misp_2022_Challenge_Audio-Visual_Diarization_And_Recognition.pdf Solo gestori archvio Tipologia: Versione Editoriale Dimensione 1.11 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.11 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/637520

Citazioni

ND

9

ND

social impact