This work investigates two strategies for zero-shot non-intrusive speech assessment leveraging large language models. First, we explore the audio analysis capabilities of GPT-4o. Second, we propose GPT-Whisper, which uses Whisper as an audio-to-text module and evaluates the text’s naturalness via targeted prompt engineering. We evaluate the assessment metrics predicted by GPT-4o and GPT-Whisper, examining their correlation with human-based quality and intelligibility assessments and the character error rate (CER) of automatic speech recognition. Experimental results show that GPT-4o alone is less effective for audio analysis, while GPT-Whisper achieves higher prediction accuracy, has moderate correlation with speech quality and intelligibility, and has higher correlation with CER. Compared to SpeechLMScore and DNSMOS, GPT-Whisper excels in intelligibility metrics, but performs slightly worse than SpeechLMScore in quality estimation. Furthermore, GPT-Whisper outperforms supervised non-intrusive models MOS-SSL and MTI-Net in Spearman’s rank correlation for Whisper’s CER. These findings validate GPT-Whisper’s potential for zero-shot speech assessment without requiring additional training data.

Zezario, R.E., Siniscalchi, S.M., Wang, H.-., Tsao, Y. (2025). A Study on Zero-shot Non-intrusive Speech Assessment using Large Language Models. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings (pp. 1-5). Institute of Electrical and Electronics Engineers Inc. [10.1109/ICASSP49660.2025.10889809].

A Study on Zero-shot Non-intrusive Speech Assessment using Large Language Models

Siniscalchi S. M.;
2025-01-01

Abstract

This work investigates two strategies for zero-shot non-intrusive speech assessment leveraging large language models. First, we explore the audio analysis capabilities of GPT-4o. Second, we propose GPT-Whisper, which uses Whisper as an audio-to-text module and evaluates the text’s naturalness via targeted prompt engineering. We evaluate the assessment metrics predicted by GPT-4o and GPT-Whisper, examining their correlation with human-based quality and intelligibility assessments and the character error rate (CER) of automatic speech recognition. Experimental results show that GPT-4o alone is less effective for audio analysis, while GPT-Whisper achieves higher prediction accuracy, has moderate correlation with speech quality and intelligibility, and has higher correlation with CER. Compared to SpeechLMScore and DNSMOS, GPT-Whisper excels in intelligibility metrics, but performs slightly worse than SpeechLMScore in quality estimation. Furthermore, GPT-Whisper outperforms supervised non-intrusive models MOS-SSL and MTI-Net in Spearman’s rank correlation for Whisper’s CER. These findings validate GPT-Whisper’s potential for zero-shot speech assessment without requiring additional training data.
2025
9798350368741
Zezario, R.E., Siniscalchi, S.M., Wang, H.-., Tsao, Y. (2025). A Study on Zero-shot Non-intrusive Speech Assessment using Large Language Models. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings (pp. 1-5). Institute of Electrical and Electronics Engineers Inc. [10.1109/ICASSP49660.2025.10889809].
File in questo prodotto:
File Dimensione Formato  
A_Study_on_Zero-shot_Non-intrusive_Speech_Assessment_using_Large_Language_Models.pdf

Solo gestori archvio

Tipologia: Versione Editoriale
Dimensione 808.02 kB
Formato Adobe PDF
808.02 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/694132
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact