To advance audio-visual speech enhancement (AVSE) research in low-quality video settings, we introduce the multimodal information-based speech processing-low quality video (MISP-LQV) benchmark, which includes a 120-hour real-world Mandarin audio-visual dataset, two video degradation simulation methods, and benchmark results from several well-known AVSE models. We also propose a novel hybrid pixel and contour network (HPCNet), incorporating a lip reconstruction and distillation (LRD) module and a contour graph convolution (CGConv) layer. Specifically, the LRD module reconstructs high-quality lip frames from low-quality audio-visual data, utilizing knowledge distillation from a teacher model trained on high-quality data. The CGConv layer employs spatio-temporal and semantic-contextual graphs to capture complex relationships among lip landmark points. Extensive experiments on the MISP-LQV benchmark reveal the performance degradation caused by low-quality video across various AVSE models. Notably, including real/simulated low-quality videos in AVSE training enhances its robustness to low-quality videos but degrades the performance of high-quality videos.The proposed HPCNet demonstrates strong robustness against video quality degradation, which can be attributed to (1) the reconstructed lip frames closely aligning with high-quality frames and (2) the contour features exhibiting consistency across different video quality levels. The generalizability of HPCNet also has been validated through experiments on the 2nd COG-MHEAR AVSE Challenge dataset.

Chen H., Zhang C.Y., Wang Q., Du J., Siniscalchi S.M., Xiong S.F., et al. (2025). HPCNet: Hybrid Pixel and Contour Network for Audio-Visual Speech Enhancement with Low-Quality Video. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 1-13 [10.1109/JSTSP.2025.3559763].

HPCNet: Hybrid Pixel and Contour Network for Audio-Visual Speech Enhancement with Low-Quality Video

Siniscalchi S. M.
Supervision
;
2025-01-01

Abstract

To advance audio-visual speech enhancement (AVSE) research in low-quality video settings, we introduce the multimodal information-based speech processing-low quality video (MISP-LQV) benchmark, which includes a 120-hour real-world Mandarin audio-visual dataset, two video degradation simulation methods, and benchmark results from several well-known AVSE models. We also propose a novel hybrid pixel and contour network (HPCNet), incorporating a lip reconstruction and distillation (LRD) module and a contour graph convolution (CGConv) layer. Specifically, the LRD module reconstructs high-quality lip frames from low-quality audio-visual data, utilizing knowledge distillation from a teacher model trained on high-quality data. The CGConv layer employs spatio-temporal and semantic-contextual graphs to capture complex relationships among lip landmark points. Extensive experiments on the MISP-LQV benchmark reveal the performance degradation caused by low-quality video across various AVSE models. Notably, including real/simulated low-quality videos in AVSE training enhances its robustness to low-quality videos but degrades the performance of high-quality videos.The proposed HPCNet demonstrates strong robustness against video quality degradation, which can be attributed to (1) the reconstructed lip frames closely aligning with high-quality frames and (2) the contour features exhibiting consistency across different video quality levels. The generalizability of HPCNet also has been validated through experiments on the 2nd COG-MHEAR AVSE Challenge dataset.
2025
Settore IINF-05/A - Sistemi di elaborazione delle informazioni
Chen H., Zhang C.Y., Wang Q., Du J., Siniscalchi S.M., Xiong S.F., et al. (2025). HPCNet: Hybrid Pixel and Contour Network for Audio-Visual Speech Enhancement with Low-Quality Video. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 1-13 [10.1109/JSTSP.2025.3559763].
File in questo prodotto:
File Dimensione Formato  
HPCNet_Hybrid_Pixel_and_Contour_Network_for_Audio-Visual_Speech_Enhancement_with_Low-Quality_Video.pdf

Solo gestori archvio

Tipologia: Post-print
Dimensione 7.83 MB
Formato Adobe PDF
7.83 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/678903
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact