We propose an integrated end-to-end automatic speech recognition (ASR) paradigm by joint learning of the front-end speech signal processing and back-end acoustic modeling. We believe that “only good signal processing can lead to top ASR performance” in challenging acoustic environments. This notion leads to a unified deep neural network (DNN) framework for distant speech processing that can achieve both high-quality enhanced speech and high-accuracy ASR simultaneously. Our goal is accomplished by two techniques, namely: (i) a reverberation-time-aware DNN based speech dereverberation architecture that can handle a wide range of reverberation times to enhance speech quality of reverberant and noisy speech, followed by (ii) DNN-based multi-condition training that takes both clean-condition and multi-condition speech into consideration, leveraging upon an exploitation of the data acquired and processed with multi-channel microphone arrays, to improve ASR performance. The final end-to-end system is established by a joint optimization of the speech enhancement and recognition DNNs.

Bo Wu, Kehuang Li, Fengpei Ge, Huang Zhen, Yang Minglei, Sabato Marco Siniscalchi, et al. (2017). An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 11(8), 1289-1300 [10.1109/JSTSP.2017.2756439].

An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition

Sabato Marco Siniscalchi
Supervision
;
2017-12-01

Abstract

We propose an integrated end-to-end automatic speech recognition (ASR) paradigm by joint learning of the front-end speech signal processing and back-end acoustic modeling. We believe that “only good signal processing can lead to top ASR performance” in challenging acoustic environments. This notion leads to a unified deep neural network (DNN) framework for distant speech processing that can achieve both high-quality enhanced speech and high-accuracy ASR simultaneously. Our goal is accomplished by two techniques, namely: (i) a reverberation-time-aware DNN based speech dereverberation architecture that can handle a wide range of reverberation times to enhance speech quality of reverberant and noisy speech, followed by (ii) DNN-based multi-condition training that takes both clean-condition and multi-condition speech into consideration, leveraging upon an exploitation of the data acquired and processed with multi-channel microphone arrays, to improve ASR performance. The final end-to-end system is established by a joint optimization of the speech enhancement and recognition DNNs.
dic-2017
Settore ING-INF/05 - Sistemi Di Elaborazione Delle Informazioni
Bo Wu, Kehuang Li, Fengpei Ge, Huang Zhen, Yang Minglei, Sabato Marco Siniscalchi, et al. (2017). An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 11(8), 1289-1300 [10.1109/JSTSP.2017.2756439].
File in questo prodotto:
File Dimensione Formato  
08051278.pdf

Solo gestori archvio

Tipologia: Versione Editoriale
Dimensione 820.27 kB
Formato Adobe PDF
820.27 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/636656
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 67
  • ???jsp.display-item.citation.isi??? 47
social impact