We propose a novel sequence-to-sequence acoustic-to-articulatory inversion (AAI) neural architecture in the temporal waveform domain. In contrast to traditional AAI approaches that leverage hand-crafted short-time spectral features obtained from the windowed signal, such as LSFs, or MFCCs, our solution directly process the input speech signal in the time domain, avoiding any intermediate signal transformation, using a cascade of 1D convolutional filters in a deep model. The time-rate synchronization between raw speech signal and the articulatory signal is obtained through a decimation process that acts upon each convolution step. Decimation in time thus avoids degradation phenomena observed in the conventional AAI procedure, caused by the need of framing the speech signal to produce a feature sequence that perfectly matches the articulatory data rate. Experimental evidence on the “Haskins Production Rate Comparison” corpus demonstrates the effectiveness of the proposed solution, which outperforms a conventional state-of-the-art AAI system leveraging MFCCs with an 20% relative improvement in terms of Pearson correlation coefficient (PCC) in mismatched speaking rate conditions. Finally, the proposed approach attains the same accuracy as the conventional AAI solution in the typical matched speaking rate condition
Shahrebabaki, A.S., Siniscalchi, S.M., Svendsen, T. (2021). Raw Speech-to-Articulatory Inversion by Temporal Filtering and Decimation. In 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH (pp. 1184-1188) [10.21437/Interspeech.2021-1429].
Raw Speech-to-Articulatory Inversion by Temporal Filtering and Decimation
Siniscalchi, Sabato MarcoSecondo
Supervision
;
2021-01-01
Abstract
We propose a novel sequence-to-sequence acoustic-to-articulatory inversion (AAI) neural architecture in the temporal waveform domain. In contrast to traditional AAI approaches that leverage hand-crafted short-time spectral features obtained from the windowed signal, such as LSFs, or MFCCs, our solution directly process the input speech signal in the time domain, avoiding any intermediate signal transformation, using a cascade of 1D convolutional filters in a deep model. The time-rate synchronization between raw speech signal and the articulatory signal is obtained through a decimation process that acts upon each convolution step. Decimation in time thus avoids degradation phenomena observed in the conventional AAI procedure, caused by the need of framing the speech signal to produce a feature sequence that perfectly matches the articulatory data rate. Experimental evidence on the “Haskins Production Rate Comparison” corpus demonstrates the effectiveness of the proposed solution, which outperforms a conventional state-of-the-art AAI system leveraging MFCCs with an 20% relative improvement in terms of Pearson correlation coefficient (PCC) in mismatched speaking rate conditions. Finally, the proposed approach attains the same accuracy as the conventional AAI solution in the typical matched speaking rate conditionFile | Dimensione | Formato | |
---|---|---|---|
shahrebabaki21_interspeech.pdf
Solo gestori archvio
Descrizione: Il testo pieno dell’articolo è disponibile al seguente link: https://www.isca-archive.org/interspeech_2021/shahrebabaki21_interspeech.html
Tipologia:
Versione Editoriale
Dimensione
571.43 kB
Formato
Adobe PDF
|
571.43 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.