Archivio istituzionale della ricerca dell'Università degli Studi di Palermo

Traditional audio-visual speaker diarization (AVSD) approaches exhibit limited robustness to cross-modal heterogene- ity and complex inter-speaker interactions, particularly under dynamic, unconstrained real-world conditions. To overcome these limitations, we propose a novel neural architecture, termed the hierarchical modality–speaker adaptive network (HMSA-Net), which integrates two sequentially structured modules and an ad-hoc multi-stage training strategy. The first consistency-gated inter-modal attention (CGIMA) module dynamically estimates cross-modal synchrony between audio and visual embeddings and adaptively regulates their mutual influence during feature fusion, thereby mitigating modality mismatch. The second dense inter- speaker attention (DISA) module explicitly captures complex inter-speaker relationships by applying multi-head attention from a target speaker representation to a densely aggregated bank of non-target speaker embeddings, enabling fine-grained discrimi- nation in overlapping speech conditions without enforcing a fixed upper bound on the number of speakers. To further enhance optimization stability, a multi-stage optimization (MSO) scheme is introduced, which consistently achieves lower convergence loss than end-to-end training. Extensive evaluations on standard AVSD benchmarks demonstrate that CGIMA effectively sup- presses modality-specific noise while amplifying complementary cross-modal cues, resulting in more robust fused representations. Meanwhile, DISA improves frame-level speaker discrimination by modeling dense cross-speaker dependencies. As a result, HMSA-Net trained with MSO achieves state-of-the-art perfor- mance on the AMI, MISP2022, and AVA-AVD

Chen, H., He, M.K., Du, J., Siniscalchi, S.M., Liu, L.J., Wan, G.S., et al. (2026). HMSA-Net: Hierarchical Modality-Speaker Adaptive Network for Audio-Visual Speaker Diarization. IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 1-13 [10.1109/TASLPRO.2026.3698952].

HMSA-Net: Hierarchical Modality-Speaker Adaptive Network for Audio-Visual Speaker Diarization

Chen H.;He M. K.;Du J.;Siniscalchi S. M.^{Writing – Review & Editing};Liu L. J.;Wan G. S.;Lee C. H.

2026-01-01

Abstract

Traditional audio-visual speaker diarization (AVSD) approaches exhibit limited robustness to cross-modal heterogene- ity and complex inter-speaker interactions, particularly under dynamic, unconstrained real-world conditions. To overcome these limitations, we propose a novel neural architecture, termed the hierarchical modality–speaker adaptive network (HMSA-Net), which integrates two sequentially structured modules and an ad-hoc multi-stage training strategy. The first consistency-gated inter-modal attention (CGIMA) module dynamically estimates cross-modal synchrony between audio and visual embeddings and adaptively regulates their mutual influence during feature fusion, thereby mitigating modality mismatch. The second dense inter- speaker attention (DISA) module explicitly captures complex inter-speaker relationships by applying multi-head attention from a target speaker representation to a densely aggregated bank of non-target speaker embeddings, enabling fine-grained discrimi- nation in overlapping speech conditions without enforcing a fixed upper bound on the number of speakers. To further enhance optimization stability, a multi-stage optimization (MSO) scheme is introduced, which consistently achieves lower convergence loss than end-to-end training. Extensive evaluations on standard AVSD benchmarks demonstrate that CGIMA effectively sup- presses modality-specific noise while amplifying complementary cross-modal cues, resulting in more robust fused representations. Meanwhile, DISA improves frame-level speaker discrimination by modeling dense cross-speaker dependencies. As a result, HMSA-Net trained with MSO achieves state-of-the-art perfor- mance on the AMI, MISP2022, and AVA-AVD

Scheda breve

Scheda completa

Scheda completa (DC)

	Data
	
				2026
			
	Settore scientifico disciplinare del contributo
	
				Settore IINF-05/A - Sistemi di elaborazione delle informazioni
			
	Titolo del periodico 
DATO PREVISTO SU LOGINMIUR
	
				IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING
			
	DOI del contributo 
DATO PREVISTO SU LOGINMIUR
	
				https://dx.doi.org/10.1109/TASLPRO.2026.3698952
			
	Citazione
	
				Chen, H., He, M.K., Du, J., Siniscalchi, S.M., Liu, L.J., Wan, G.S., et al. (2026). HMSA-Net: Hierarchical Modality-Speaker Adaptive Network for Audio-Visual Speaker Diarization. IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 1-13 [10.1109/TASLPRO.2026.3698952].
			
	Appare nelle tipologie:
	
				1.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
HMSA-Net_Hierarchical_Modality-Speaker_Adaptive_Network_for_Audio-Visual_Speaker_Diarization.pdf Solo gestori archvio Descrizione: Main Document Tipologia: Post-print Dimensione 2.95 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	2.95 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/709077

Citazioni

ND

0

ND

social impact