This work presents HDA-SELD, a unified hierarchical distillation and augmentation framework for audio-visual (AV) sound event localization and detection (SELD) designed to address the challenge of data scarcity. The proposed framework integrates hierarchical cross-modal distillation (HCMD) to transfer knowledge from a robust audio-only SELD teacher to an AV student through both output responses and intermediate hidden representations. To enhance learning, we introduce a multi-level data augmentation strategy that mixes features randomly selected from multiple network layers and associated loss functions tailored to the SELD task. By employing loss interpolation instead of direct label manipulation, the strategy ensures spatial consistency during the augmentation process. Extensive experiments on the DCASE 2023 and 2024 Challenge SELD datasets show that the proposed method significantly improves AV SELD performance, yielding relative gains of 21%-38% in the overall metric over the baselines. Notably, our proposed HDA-SELD achieves results comparable to or better than teacher models trained on much larger datasets, surpassing state-of-the-art methods on both DCASE 2023 and 2024 Challenge SELD tasks.

Wang, Q., Jiang, Y., Chen, H., Siniscalchi, S.M., Du, J., Gao, J. (2026). HDA-SELD: Hierarchical Cross-Modal Distillation with Multi-Level Data Augmentation for Low-Resource Audio-Visual Sound Event Localization and Detection. IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 34, 1915-1928 [10.1109/TASLPRO.2026.3677672].

HDA-SELD: Hierarchical Cross-Modal Distillation with Multi-Level Data Augmentation for Low-Resource Audio-Visual Sound Event Localization and Detection

Siniscalchi S. M.;
2026-01-01

Abstract

This work presents HDA-SELD, a unified hierarchical distillation and augmentation framework for audio-visual (AV) sound event localization and detection (SELD) designed to address the challenge of data scarcity. The proposed framework integrates hierarchical cross-modal distillation (HCMD) to transfer knowledge from a robust audio-only SELD teacher to an AV student through both output responses and intermediate hidden representations. To enhance learning, we introduce a multi-level data augmentation strategy that mixes features randomly selected from multiple network layers and associated loss functions tailored to the SELD task. By employing loss interpolation instead of direct label manipulation, the strategy ensures spatial consistency during the augmentation process. Extensive experiments on the DCASE 2023 and 2024 Challenge SELD datasets show that the proposed method significantly improves AV SELD performance, yielding relative gains of 21%-38% in the overall metric over the baselines. Notably, our proposed HDA-SELD achieves results comparable to or better than teacher models trained on much larger datasets, surpassing state-of-the-art methods on both DCASE 2023 and 2024 Challenge SELD tasks.
2026
Wang, Q., Jiang, Y., Chen, H., Siniscalchi, S.M., Du, J., Gao, J. (2026). HDA-SELD: Hierarchical Cross-Modal Distillation with Multi-Level Data Augmentation for Low-Resource Audio-Visual Sound Event Localization and Detection. IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 34, 1915-1928 [10.1109/TASLPRO.2026.3677672].
File in questo prodotto:
File Dimensione Formato  
HDA-SELD_Hierarchical_Cross-Modal_Distillation_With_Multi-Level_Data_Augmentation_for_Low-Resource_Audio-Visual_Sound_Event_Localization_and_Detection.pdf

Solo gestori archvio

Tipologia: Versione Editoriale
Dimensione 2.22 MB
Formato Adobe PDF
2.22 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/703825
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact