We propose an integrated end-to-end automatic speech recognition (ASR) paradigm by joint learning of the front-end speech signal processing and back-end acoustic modeling. We believe that “only good signal processing can lead to top ASR performance” in challenging acoustic environments. This notion leads to a unified deep neural network (DNN) framework for distant speech processing that can achieve both high-quality enhanced speech and high-accuracy ASR simultaneously. Our goal is accomplished by two techniques, namely: (i) a reverberation-time-aware DNN based speech dereverberation architecture that can handle a wide range of reverberation times to enhance speech quality of reverberant and noisy speech, followed by (ii) DNN-based multi-condition training that takes both clean-condition and multi-condition speech into consideration, leveraging upon an exploitation of the data acquired and processed with multi-channel microphone arrays, to improve ASR performance. The final end-to-end system is established by a joint optimization of the speech enhancement and recognition DNNs.
Bo Wu, Kehuang Li, Fengpei Ge, Huang Zhen, Yang Minglei, Sabato Marco Siniscalchi, et al. (2017). An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 11(8), 1289-1300 [10.1109/JSTSP.2017.2756439].
An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition
Sabato Marco SiniscalchiSupervision
;
2017-12-01
Abstract
We propose an integrated end-to-end automatic speech recognition (ASR) paradigm by joint learning of the front-end speech signal processing and back-end acoustic modeling. We believe that “only good signal processing can lead to top ASR performance” in challenging acoustic environments. This notion leads to a unified deep neural network (DNN) framework for distant speech processing that can achieve both high-quality enhanced speech and high-accuracy ASR simultaneously. Our goal is accomplished by two techniques, namely: (i) a reverberation-time-aware DNN based speech dereverberation architecture that can handle a wide range of reverberation times to enhance speech quality of reverberant and noisy speech, followed by (ii) DNN-based multi-condition training that takes both clean-condition and multi-condition speech into consideration, leveraging upon an exploitation of the data acquired and processed with multi-channel microphone arrays, to improve ASR performance. The final end-to-end system is established by a joint optimization of the speech enhancement and recognition DNNs.File | Dimensione | Formato | |
---|---|---|---|
08051278.pdf
Solo gestori archvio
Tipologia:
Versione Editoriale
Dimensione
820.27 kB
Formato
Adobe PDF
|
820.27 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.