We propose a novel data utilization strategy, called multichannel-condition learning, leveraging upon complementary information captured in microphone array speech to jointly train dereverberation and acoustic deep neural network (DNN) models for robust distant speech recognition. Experimental results, with a single automatic speech recognition (ASR) system, on the REVERB2014 simulated evaluation data show that, on 1-channel testing, the baseline joint training scheme attains a word error rate (WER) of 7.47%, reduced from 8.72% for separate training. The proposed multi-channel-condition learning scheme has been experimented on different channel data combinations and usage showing many interesting implications. Finally, training on all 8-channel data and with DNN-based language model rescoring, a state-of-the-art WER of 4.05% is achieved. We anticipate an even lower WER when combining more top ASR systems.

Ge F., Li K., Wu B., Siniscalchi S.M., Yan Y., Lee C.-H. (2017). Joint training of multi-channel-condition dereverberation and acoustic modeling of microphone array speech for robust distant speech recognition. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (pp. 3847-3851). ;4 Rue des Fauvettes - Lous Tourils : International Speech Communication Association [10.21437/Interspeech.2017-579].

Joint training of multi-channel-condition dereverberation and acoustic modeling of microphone array speech for robust distant speech recognition

Siniscalchi S. M.
Membro del Collaboration Group
;
2017-01-01

Abstract

We propose a novel data utilization strategy, called multichannel-condition learning, leveraging upon complementary information captured in microphone array speech to jointly train dereverberation and acoustic deep neural network (DNN) models for robust distant speech recognition. Experimental results, with a single automatic speech recognition (ASR) system, on the REVERB2014 simulated evaluation data show that, on 1-channel testing, the baseline joint training scheme attains a word error rate (WER) of 7.47%, reduced from 8.72% for separate training. The proposed multi-channel-condition learning scheme has been experimented on different channel data combinations and usage showing many interesting implications. Finally, training on all 8-channel data and with DNN-based language model rescoring, a state-of-the-art WER of 4.05% is achieved. We anticipate an even lower WER when combining more top ASR systems.
2017
Ge F., Li K., Wu B., Siniscalchi S.M., Yan Y., Lee C.-H. (2017). Joint training of multi-channel-condition dereverberation and acoustic modeling of microphone array speech for robust distant speech recognition. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (pp. 3847-3851). ;4 Rue des Fauvettes - Lous Tourils : International Speech Communication Association [10.21437/Interspeech.2017-579].
File in questo prodotto:
File Dimensione Formato  
ge17_interspeech.pdf

Solo gestori archvio

Descrizione: Il testo pieno dell’articolo è disponibile al seguente link: https://www.isca-archive.org/interspeech_2017/ge17_interspeech.html
Tipologia: Versione Editoriale
Dimensione 1.25 MB
Formato Adobe PDF
1.25 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/649496
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? 2
social impact