We propose a novel sequence-to-sequence acoustic-to-articulatory inversion (AAI) neural architecture in the temporal waveform domain. In contrast to traditional AAI approaches that leverage hand-crafted short-time spectral features obtained from the windowed signal, such as LSFs, or MFCCs, our solution directly process the input speech signal in the time domain, avoiding any intermediate signal transformation, using a cascade of 1D convolutional filters in a deep model. The time-rate synchronization between raw speech signal and the articulatory signal is obtained through a decimation process that acts upon each convolution step. Decimation in time thus avoids degradation phenomena observed in the conventional AAI procedure, caused by the need of framing the speech signal to produce a feature sequence that perfectly matches the articulatory data rate. Experimental evidence on the “Haskins Production Rate Comparison” corpus demonstrates the effectiveness of the proposed solution, which outperforms a conventional state-of-the-art AAI system leveraging MFCCs with an 20% relative improvement in terms of Pearson correlation coefficient (PCC) in mismatched speaking rate conditions. Finally, the proposed approach attains the same accuracy as the conventional AAI solution in the typical matched speaking rate condition

Shahrebabaki, A.S., Siniscalchi, S.M., Svendsen, T. (2021). Raw Speech-to-Articulatory Inversion by Temporal Filtering and Decimation. In 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH (pp. 1184-1188) [10.21437/Interspeech.2021-1429].

Raw Speech-to-Articulatory Inversion by Temporal Filtering and Decimation

Siniscalchi, Sabato Marco
Secondo
Supervision
;
2021-01-01

Abstract

We propose a novel sequence-to-sequence acoustic-to-articulatory inversion (AAI) neural architecture in the temporal waveform domain. In contrast to traditional AAI approaches that leverage hand-crafted short-time spectral features obtained from the windowed signal, such as LSFs, or MFCCs, our solution directly process the input speech signal in the time domain, avoiding any intermediate signal transformation, using a cascade of 1D convolutional filters in a deep model. The time-rate synchronization between raw speech signal and the articulatory signal is obtained through a decimation process that acts upon each convolution step. Decimation in time thus avoids degradation phenomena observed in the conventional AAI procedure, caused by the need of framing the speech signal to produce a feature sequence that perfectly matches the articulatory data rate. Experimental evidence on the “Haskins Production Rate Comparison” corpus demonstrates the effectiveness of the proposed solution, which outperforms a conventional state-of-the-art AAI system leveraging MFCCs with an 20% relative improvement in terms of Pearson correlation coefficient (PCC) in mismatched speaking rate conditions. Finally, the proposed approach attains the same accuracy as the conventional AAI solution in the typical matched speaking rate condition
2021
Settore ING-INF/05 - Sistemi Di Elaborazione Delle Informazioni
Shahrebabaki, A.S., Siniscalchi, S.M., Svendsen, T. (2021). Raw Speech-to-Articulatory Inversion by Temporal Filtering and Decimation. In 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH (pp. 1184-1188) [10.21437/Interspeech.2021-1429].
File in questo prodotto:
File Dimensione Formato  
shahrebabaki21_interspeech.pdf

accesso aperto

Tipologia: Versione Editoriale
Dimensione 571.43 kB
Formato Adobe PDF
571.43 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/636622
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 1
social impact