Archivio istituzionale della ricerca dell'Università degli Studi di Palermo

Advancements in deep neural networks have allowed automatic speech recognition (ASR) systems to attain human parity on several publicly available clean speech datasets. However, even state-of-the-art ASR systems experience performance degradation when confronted with adverse conditions, as a well-trained acoustic model is sensitive to variations in the speech domain, e.g., background noise. Intuitively, humans address this issue by relying on their linguistic knowledge: the meaning of ambiguous spoken terms is usually inferred from contextual cues thereby reducing the dependency on the auditory system. Inspired by this observation, we introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction, where N-best decoding hypotheses provide informative elements for true transcription prediction. This approach is a paradigm shift from the traditional language model rescoring strategy that can only select one candidate hypothesis as the output transcription. The proposed benchmark contains a novel dataset, “HyPoradise” (HP), encompassing more than 334, 000 pairs of N-best hypotheses and corresponding accurate transcriptions across prevalent speech domains. Given this dataset, we examine three types of error correction techniques based on LLMs with varying amounts of labeled hypotheses-transcription pairs, which gains a significant word error rate (WER) reduction. Experimental evidence demonstrates the proposed technique achieves a breakthrough by surpassing the upper bound of traditional re-ranking based methods. More surprisingly, LLM with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list. We make our results publicly accessible for reproducible pipelines with released pre-trained models, thus providing a new evaluation paradigm for ASR error correction with LLMs.

Chen C., Hu Y., Yang C.-H.H., Siniscalchi S.M., Chen P.-Y., Chng E.S. (2023). HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models. In Advances in Neural Information Processing Systems. Neural information processing systems foundation.

HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models

Hu Y.;Yang C. -H. H.;Siniscalchi S. M.^Supervision;Chen P. -Y.;Chng E. S.

2023-01-01

Abstract

Advancements in deep neural networks have allowed automatic speech recognition (ASR) systems to attain human parity on several publicly available clean speech datasets. However, even state-of-the-art ASR systems experience performance degradation when confronted with adverse conditions, as a well-trained acoustic model is sensitive to variations in the speech domain, e.g., background noise. Intuitively, humans address this issue by relying on their linguistic knowledge: the meaning of ambiguous spoken terms is usually inferred from contextual cues thereby reducing the dependency on the auditory system. Inspired by this observation, we introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction, where N-best decoding hypotheses provide informative elements for true transcription prediction. This approach is a paradigm shift from the traditional language model rescoring strategy that can only select one candidate hypothesis as the output transcription. The proposed benchmark contains a novel dataset, “HyPoradise” (HP), encompassing more than 334, 000 pairs of N-best hypotheses and corresponding accurate transcriptions across prevalent speech domains. Given this dataset, we examine three types of error correction techniques based on LLMs with varying amounts of labeled hypotheses-transcription pairs, which gains a significant word error rate (WER) reduction. Experimental evidence demonstrates the proposed technique achieves a breakthrough by surpassing the upper bound of traditional re-ranking based methods. More surprisingly, LLM with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list. We make our results publicly accessible for reproducible pipelines with released pre-trained models, thus providing a new evaluation paradigm for ASR error correction with LLMs.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data
	
				2023
			
	Settore scientifico disciplinare del contributo
	
				Settore ING-INF/05 - Sistemi Di Elaborazione Delle Informazioni
			
	URL dell'editore (Open access ove possibile)
	
				https://openreview.net/forum?id=cAjZ3tMye6&referrer=[the profile of Yuchen Hu](/profile?id=~Yuchen_Hu1)
			
	Citazione
	
				Chen C.,  Hu Y.,  Yang C.-H.H.,  Siniscalchi S.M.,  Chen P.-Y.,  Chng E.S. (2023). HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models. In Advances in Neural Information Processing Systems. Neural information processing systems foundation.
			
	Appare nelle tipologie:
	
				2.07 Contributo in atti di convegno pubblicato in volume

File in questo prodotto:

File	Dimensione	Formato
840_hyporadise_an_open_baseline_fo (1).pdf accesso aperto Descrizione: post-print Tipologia: Post-print Dimensione 932.73 kB Formato Adobe PDF Visualizza/Apri	932.73 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/637519

Citazioni

ND

2

ND

social impact