Archivio istituzionale della ricerca dell'Università degli Studi di Palermo

In recent years deep neural networks (DNNs) - multilayer perceptrons (MLPs) with many hidden layers - have been successfully applied to several speech tasks, i.e., phoneme recognition, out of vocabulary word detection, confidence measure, etc. In this paper, we show that DNNs can be used to boost the classification accuracy of basic speech units, such as phonetic attributes (phonological features) and phonemes. This boosting leads to higher flexibility and has the potential to integrate both top-down and bottom-up knowledge into the Automatic Speech Attribute Transcription (ASAT) framework. ASAT is a new family of lattice-based speech recognition systems grounded on accurate detection of speech attributes. In this paper we compare DNNs and shallow MLPs within the ASAT framework to classify phonetic attributes and phonemes. Several DNN architectures ranging from five to seven hidden layers and up to 2048 hidden units per hidden layer will be presented and evaluated. Experimental evidence on the speaker-independent Wall Street Journal corpus clearly demonstrates that DNNs can achieve significant improvements over the shallow MLPs with a single hidden layer, producing greater than 90% frame-level attribute estimation accuracies for all 21 phonetic features tested. Similar improvement is also observed on the phoneme classification task with excellent frame-level accuracy of 86.6% by using DNNs. This improved phoneme prediction accuracy, when integrated into a standard large vocabulary continuous speech recognition (LVCSR) system through a word lattice rescoring framework, results in improved word recognition accuracy, which is better than previously reported word lattice rescoring results.

SINISCALCHI, S.M., Yu D., Deng L., Lee C. H. (2013). Exploiting Deep Neural Networks for Detection-Based Speech Recognition. NEUROCOMPUTING, 106, 148-157 [10.1016/j.neucom.2012.11.008].

Exploiting Deep Neural Networks for Detection-Based Speech Recognition

SINISCALCHI, SABATO MARCO^{Primo

Investigation};Yu D.;Deng L.;Lee C. H.

2013-01-01

Abstract

In recent years deep neural networks (DNNs) - multilayer perceptrons (MLPs) with many hidden layers - have been successfully applied to several speech tasks, i.e., phoneme recognition, out of vocabulary word detection, confidence measure, etc. In this paper, we show that DNNs can be used to boost the classification accuracy of basic speech units, such as phonetic attributes (phonological features) and phonemes. This boosting leads to higher flexibility and has the potential to integrate both top-down and bottom-up knowledge into the Automatic Speech Attribute Transcription (ASAT) framework. ASAT is a new family of lattice-based speech recognition systems grounded on accurate detection of speech attributes. In this paper we compare DNNs and shallow MLPs within the ASAT framework to classify phonetic attributes and phonemes. Several DNN architectures ranging from five to seven hidden layers and up to 2048 hidden units per hidden layer will be presented and evaluated. Experimental evidence on the speaker-independent Wall Street Journal corpus clearly demonstrates that DNNs can achieve significant improvements over the shallow MLPs with a single hidden layer, producing greater than 90% frame-level attribute estimation accuracies for all 21 phonetic features tested. Similar improvement is also observed on the phoneme classification task with excellent frame-level accuracy of 86.6% by using DNNs. This improved phoneme prediction accuracy, when integrated into a standard large vocabulary continuous speech recognition (LVCSR) system through a word lattice rescoring framework, results in improved word recognition accuracy, which is better than previously reported word lattice rescoring results.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data
	
				2013
			
	Titolo del periodico 
DATO PREVISTO SU LOGINMIUR
	
				NEUROCOMPUTING
			
	DOI del contributo 
DATO PREVISTO SU LOGINMIUR
	
				https://dx.doi.org/10.1016/j.neucom.2012.11.008
			
	URL alternativo rispetto a quello dell'editore 
DATO PREVISTO SU LOGINMIUR
	
				http://www.sciencedirect.com/science/article/pii/S0925231212008636
			
	Citazione
	
				SINISCALCHI, S.M.,  Yu D.,  Deng L.,  Lee C. H. (2013). Exploiting Deep Neural Networks for Detection-Based Speech Recognition. NEUROCOMPUTING, 106, 148-157 [10.1016/j.neucom.2012.11.008].
			
	Appare nelle tipologie:
	
				1.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
Neurocomputing_.pdf Solo gestori archvio Tipologia: Versione Editoriale Dimensione 817.72 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	817.72 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/649556

Citazioni

ND

100

84

social impact