Archivio istituzionale della ricerca dell'Università degli Studi di Palermo

Perceptual voice quality assessment plays a vital role in diagnosing and monitoring voice disorders. Traditional methods, such as the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) and the Grade, Roughness, Breathiness, Asthenia, and Strain (GRBAS) scales, rely on expert raters and are prone to inter-rater variability, emphasizing the need for objective solutions. This study introduces the Voice Quality Assessment Network (VOQANet), a deep learning framework that employs an attention mechanism and Speech Foundation Model (SFM) embeddings to extract high-level features. To further enhance performance, we propose VOQANet+, which integrates self-supervised SFM embeddings with low-level acoustic descriptors-namely jitter, shimmer, and harmonics-to-noise ratio (HNR). Unlike previous approaches that focus solely on vowel-based phonation (PVQD-A), our models are evaluated on both vowel-level and sentence-level speech (PVQD-S) to assess generalizability. Experimental results demonstrate that sentence-based inputs yield higher accuracy, particularly at the patient level. Overall, VOQANet consistently outperforms baseline models in terms of root mean squared error (RMSE) and Pearson correlation coefficient across CAPE-V and GRBAS dimensions, with VOQANet+ achieving even greater performance gains. Additionally, VOQANet+ maintains consistent performance under noisy conditions, suggesting enhanced robustness for real-world and telehealth applications. This work highlights the value of combining SFM embeddings with low-level features for accurate and robust pathological voice assessment.

Ariyanti, W., Chen, K.Y., Siniscalchi, S.M., Wang, H.M., Tsao, Y. (2025). Towards Robust Assessment of Pathological Voices via Combined Low-Level Descriptors and Foundation Model Representations. IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 1-13 [10.1109/JBHI.2025.3644692].

Towards Robust Assessment of Pathological Voices via Combined Low-Level Descriptors and Foundation Model Representations

Ariyanti W.;Chen K. Y.;Siniscalchi S. M.;Wang H. M.;Tsao Y.

2025-12-15

Abstract

Perceptual voice quality assessment plays a vital role in diagnosing and monitoring voice disorders. Traditional methods, such as the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) and the Grade, Roughness, Breathiness, Asthenia, and Strain (GRBAS) scales, rely on expert raters and are prone to inter-rater variability, emphasizing the need for objective solutions. This study introduces the Voice Quality Assessment Network (VOQANet), a deep learning framework that employs an attention mechanism and Speech Foundation Model (SFM) embeddings to extract high-level features. To further enhance performance, we propose VOQANet+, which integrates self-supervised SFM embeddings with low-level acoustic descriptors-namely jitter, shimmer, and harmonics-to-noise ratio (HNR). Unlike previous approaches that focus solely on vowel-based phonation (PVQD-A), our models are evaluated on both vowel-level and sentence-level speech (PVQD-S) to assess generalizability. Experimental results demonstrate that sentence-based inputs yield higher accuracy, particularly at the patient level. Overall, VOQANet consistently outperforms baseline models in terms of root mean squared error (RMSE) and Pearson correlation coefficient across CAPE-V and GRBAS dimensions, with VOQANet+ achieving even greater performance gains. Additionally, VOQANet+ maintains consistent performance under noisy conditions, suggesting enhanced robustness for real-world and telehealth applications. This work highlights the value of combining SFM embeddings with low-level features for accurate and robust pathological voice assessment.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data
	
				15-dic-2025
			
	Settore scientifico disciplinare del contributo
	
				Settore IINF-05/A - Sistemi di elaborazione delle informazioni
			
	Titolo del periodico 
DATO PREVISTO SU LOGINMIUR
	
				IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS
			
	DOI del contributo 
DATO PREVISTO SU LOGINMIUR
	
				https://dx.doi.org/10.1109/JBHI.2025.3644692
			
	URL dell'editore (Open access ove possibile)
	
				https://ieeexplore.ieee.org/document/11300933
			
	Citazione
	
				Ariyanti, W., Chen, K.Y., Siniscalchi, S.M., Wang, H.M., Tsao, Y. (2025). Towards Robust Assessment of Pathological Voices via Combined Low-Level Descriptors and Foundation Model Representations. IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 1-13 [10.1109/JBHI.2025.3644692].
			
	Appare nelle tipologie:
	
				1.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
Towards_Robust_Assessment_of_Pathological_Voices_via_Combined_Low-Level_Descriptors_and_Foundation_Model_Representations.pdf accesso aperto Tipologia: Post-print Dimensione 4.13 MB Formato Adobe PDF Visualizza/Apri	4.13 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/698568

Citazioni

1

0

ND

social impact