Traditional audio-visual speaker diarization (AVSD) approaches exhibit limited robustness to cross-modal heterogene- ity and complex inter-speaker interactions, particularly under dynamic, unconstrained real-world conditions. To overcome these limitations, we propose a novel neural architecture, termed the hierarchical modality–speaker adaptive network (HMSA-Net), which integrates two sequentially structured modules and an ad-hoc multi-stage training strategy. The first consistency-gated inter-modal attention (CGIMA) module dynamically estimates cross-modal synchrony between audio and visual embeddings and adaptively regulates their mutual influence during feature fusion, thereby mitigating modality mismatch. The second dense inter- speaker attention (DISA) module explicitly captures complex inter-speaker relationships by applying multi-head attention from a target speaker representation to a densely aggregated bank of non-target speaker embeddings, enabling fine-grained discrimi- nation in overlapping speech conditions without enforcing a fixed upper bound on the number of speakers. To further enhance optimization stability, a multi-stage optimization (MSO) scheme is introduced, which consistently achieves lower convergence loss than end-to-end training. Extensive evaluations on standard AVSD benchmarks demonstrate that CGIMA effectively sup- presses modality-specific noise while amplifying complementary cross-modal cues, resulting in more robust fused representations. Meanwhile, DISA improves frame-level speaker discrimination by modeling dense cross-speaker dependencies. As a result, HMSA-Net trained with MSO achieves state-of-the-art perfor- mance on the AMI, MISP2022, and AVA-AVD
Chen, H., He, M.K., Du, J., Siniscalchi, S.M., Liu, L.J., Wan, G.S., et al. (2026). HMSA-Net: Hierarchical Modality-Speaker Adaptive Network for Audio-Visual Speaker Diarization. IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 1-13 [10.1109/TASLPRO.2026.3698952].
HMSA-Net: Hierarchical Modality-Speaker Adaptive Network for Audio-Visual Speaker Diarization
Siniscalchi S. M.Writing – Review & Editing
;
2026-01-01
Abstract
Traditional audio-visual speaker diarization (AVSD) approaches exhibit limited robustness to cross-modal heterogene- ity and complex inter-speaker interactions, particularly under dynamic, unconstrained real-world conditions. To overcome these limitations, we propose a novel neural architecture, termed the hierarchical modality–speaker adaptive network (HMSA-Net), which integrates two sequentially structured modules and an ad-hoc multi-stage training strategy. The first consistency-gated inter-modal attention (CGIMA) module dynamically estimates cross-modal synchrony between audio and visual embeddings and adaptively regulates their mutual influence during feature fusion, thereby mitigating modality mismatch. The second dense inter- speaker attention (DISA) module explicitly captures complex inter-speaker relationships by applying multi-head attention from a target speaker representation to a densely aggregated bank of non-target speaker embeddings, enabling fine-grained discrimi- nation in overlapping speech conditions without enforcing a fixed upper bound on the number of speakers. To further enhance optimization stability, a multi-stage optimization (MSO) scheme is introduced, which consistently achieves lower convergence loss than end-to-end training. Extensive evaluations on standard AVSD benchmarks demonstrate that CGIMA effectively sup- presses modality-specific noise while amplifying complementary cross-modal cues, resulting in more robust fused representations. Meanwhile, DISA improves frame-level speaker discrimination by modeling dense cross-speaker dependencies. As a result, HMSA-Net trained with MSO achieves state-of-the-art perfor- mance on the AMI, MISP2022, and AVA-AVD| File | Dimensione | Formato | |
|---|---|---|---|
|
HMSA-Net_Hierarchical_Modality-Speaker_Adaptive_Network_for_Audio-Visual_Speaker_Diarization.pdf
Solo gestori archvio
Descrizione: Main Document
Tipologia:
Post-print
Dimensione
2.95 MB
Formato
Adobe PDF
|
2.95 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


