Archivio istituzionale della ricerca dell'Università degli Studi di Palermo

This work presents HDA-SELD, a unified hierarchical distillation and augmentation framework for audio-visual (AV) sound event localization and detection (SELD) designed to address the challenge of data scarcity. The proposed framework integrates hierarchical cross-modal distillation (HCMD) to transfer knowledge from a robust audio-only SELD teacher to an AV student through both output responses and intermediate hidden representations. To enhance learning, we introduce a multi-level data augmentation strategy that mixes features randomly selected from multiple network layers and associated loss functions tailored to the SELD task. By employing loss interpolation instead of direct label manipulation, the strategy ensures spatial consistency during the augmentation process. Extensive experiments on the DCASE 2023 and 2024 Challenge SELD datasets show that the proposed method significantly improves AV SELD performance, yielding relative gains of 21%-38% in the overall metric over the baselines. Notably, our proposed HDA-SELD achieves results comparable to or better than teacher models trained on much larger datasets, surpassing state-of-the-art methods on both DCASE 2023 and 2024 Challenge SELD tasks.

Wang, Q., Jiang, Y., Chen, H., Siniscalchi, S.M., Du, J., Gao, J. (2026). HDA-SELD: Hierarchical Cross-Modal Distillation with Multi-Level Data Augmentation for Low-Resource Audio-Visual Sound Event Localization and Detection. IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 34, 1915-1928 [10.1109/TASLPRO.2026.3677672].

HDA-SELD: Hierarchical Cross-Modal Distillation with Multi-Level Data Augmentation for Low-Resource Audio-Visual Sound Event Localization and Detection

Wang Q.;Jiang Y.;Chen H.;Siniscalchi S. M.;Du J.;Gao J.

2026-01-01

Abstract

This work presents HDA-SELD, a unified hierarchical distillation and augmentation framework for audio-visual (AV) sound event localization and detection (SELD) designed to address the challenge of data scarcity. The proposed framework integrates hierarchical cross-modal distillation (HCMD) to transfer knowledge from a robust audio-only SELD teacher to an AV student through both output responses and intermediate hidden representations. To enhance learning, we introduce a multi-level data augmentation strategy that mixes features randomly selected from multiple network layers and associated loss functions tailored to the SELD task. By employing loss interpolation instead of direct label manipulation, the strategy ensures spatial consistency during the augmentation process. Extensive experiments on the DCASE 2023 and 2024 Challenge SELD datasets show that the proposed method significantly improves AV SELD performance, yielding relative gains of 21%-38% in the overall metric over the baselines. Notably, our proposed HDA-SELD achieves results comparable to or better than teacher models trained on much larger datasets, surpassing state-of-the-art methods on both DCASE 2023 and 2024 Challenge SELD tasks.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data
	
				2026
			
	Titolo del periodico 
DATO PREVISTO SU LOGINMIUR
	
				IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING
			
	DOI del contributo 
DATO PREVISTO SU LOGINMIUR
	
				https://dx.doi.org/10.1109/TASLPRO.2026.3677672
			
	Citazione
	
				Wang, Q., Jiang, Y., Chen, H., Siniscalchi, S.M., Du, J., Gao, J. (2026). HDA-SELD: Hierarchical Cross-Modal Distillation with Multi-Level Data Augmentation for Low-Resource Audio-Visual Sound Event Localization and Detection. IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 34, 1915-1928 [10.1109/TASLPRO.2026.3677672].
			
	Appare nelle tipologie:
	
				1.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
HDA-SELD_Hierarchical_Cross-Modal_Distillation_With_Multi-Level_Data_Augmentation_for_Low-Resource_Audio-Visual_Sound_Event_Localization_and_Detection.pdf Solo gestori archvio Tipologia: Versione Editoriale Dimensione 2.22 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	2.22 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/703825

Citazioni

ND

0

ND

social impact