Archivio istituzionale della ricerca dell'Università degli Studi di Palermo

To advance audio-visual speech enhancement (AVSE) research in low-quality video settings, we introduce the multimodal information-based speech processing-low quality video (MISP-LQV) benchmark, which includes a 120-hour real-world Mandarin audio-visual dataset, two video degradation simulation methods, and benchmark results from several well-known AVSE models. We also propose a novel hybrid pixel and contour network (HPCNet), incorporating a lip reconstruction and distillation (LRD) module and a contour graph convolution (CGConv) layer. Specifically, the LRD module reconstructs high-quality lip frames from low-quality audio-visual data, utilizing knowledge distillation from a teacher model trained on high-quality data. The CGConv layer employs spatio-temporal and semantic-contextual graphs to capture complex relationships among lip landmark points. Extensive experiments on the MISP-LQV benchmark reveal the performance degradation caused by low-quality video across various AVSE models. Notably, including real/simulated low-quality videos in AVSE training enhances its robustness to low-quality videos but degrades the performance of high-quality videos.The proposed HPCNet demonstrates strong robustness against video quality degradation, which can be attributed to (1) the reconstructed lip frames closely aligning with high-quality frames and (2) the contour features exhibiting consistency across different video quality levels. The generalizability of HPCNet also has been validated through experiments on the 2nd COG-MHEAR AVSE Challenge dataset.

Chen H., Zhang C.Y., Wang Q., Du J., Siniscalchi S.M., Xiong S.F., et al. (2025). HPCNet: Hybrid Pixel and Contour Network for Audio-Visual Speech Enhancement with Low-Quality Video. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 1-13 [10.1109/JSTSP.2025.3559763].

HPCNet: Hybrid Pixel and Contour Network for Audio-Visual Speech Enhancement with Low-Quality Video

Chen H.;Zhang C. Y.;Wang Q.;Du J.;Siniscalchi S. M.^Supervision;Xiong S. F.;Wan G. S.

2025-01-01

Abstract

To advance audio-visual speech enhancement (AVSE) research in low-quality video settings, we introduce the multimodal information-based speech processing-low quality video (MISP-LQV) benchmark, which includes a 120-hour real-world Mandarin audio-visual dataset, two video degradation simulation methods, and benchmark results from several well-known AVSE models. We also propose a novel hybrid pixel and contour network (HPCNet), incorporating a lip reconstruction and distillation (LRD) module and a contour graph convolution (CGConv) layer. Specifically, the LRD module reconstructs high-quality lip frames from low-quality audio-visual data, utilizing knowledge distillation from a teacher model trained on high-quality data. The CGConv layer employs spatio-temporal and semantic-contextual graphs to capture complex relationships among lip landmark points. Extensive experiments on the MISP-LQV benchmark reveal the performance degradation caused by low-quality video across various AVSE models. Notably, including real/simulated low-quality videos in AVSE training enhances its robustness to low-quality videos but degrades the performance of high-quality videos.The proposed HPCNet demonstrates strong robustness against video quality degradation, which can be attributed to (1) the reconstructed lip frames closely aligning with high-quality frames and (2) the contour features exhibiting consistency across different video quality levels. The generalizability of HPCNet also has been validated through experiments on the 2nd COG-MHEAR AVSE Challenge dataset.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data
	
				2025
			
	Settore scientifico disciplinare del contributo
	
				Settore IINF-05/A - Sistemi di elaborazione delle informazioni
			
	Titolo del periodico 
DATO PREVISTO SU LOGINMIUR
	
				IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING
			
	DOI del contributo 
DATO PREVISTO SU LOGINMIUR
	
				https://dx.doi.org/10.1109/JSTSP.2025.3559763
			
	Citazione
	
				Chen H.,  Zhang C.Y.,  Wang Q.,  Du J.,  Siniscalchi S.M.,  Xiong S.F., et al. (2025). HPCNet: Hybrid Pixel and Contour Network for Audio-Visual Speech Enhancement with Low-Quality Video. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 1-13 [10.1109/JSTSP.2025.3559763].
			
	Appare nelle tipologie:
	
				1.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
HPCNet_Hybrid_Pixel_and_Contour_Network_for_Audio-Visual_Speech_Enhancement_with_Low-Quality_Video.pdf Solo gestori archvio Tipologia: Post-print Dimensione 7.83 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	7.83 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/678903

Citazioni

ND

1

ND

social impact