Archivio istituzionale della ricerca dell'Università degli Studi di Palermo

We propose a new acoustic-to-articulatory inversion (AAI) sequence-to-sequence neural architecture, where spectral sub-bands are independently processed in time by 1-dimensional (1-D) convolutional filters of different sizes. The learned features maps are then combined and processed by a recurrent block with bi-directional long short-term memory (BLSTM) gates for preserving the smoothly varying nature of the articulatory trajectories. Our experimental evidence shows that, on a speaker dependent AAI task, in spite of the reduced number of parameters, our model demonstrates better root mean squared error (RMSE) and Pearson's correlation coefficient (PCC) than a both a BLSTM model and an FC-BLSTM model where the first stages are fully connected layers. In particular, the average RMSE goes from 1.401 when feeding the filterbank features directly into the BLSTM, to 1.328 with the FC-BLSTM model, and to 1.216 with the proposed method. Similarly, the average PCC increases from 0.859 to 0.877, and 0.895, respectively. On a speaker independent AAI task, we show that our convolutional features outperform the original filterbank features, and can be combined with phonetic features bringing independent information to the solution of the problem. To the best of the authors' knowledge, we report the best results on the given task and data.

Shahrebabaki, A.S., Siniscalchi, S.M., Salvi, G., Svendsen, T. (2020). Sequence-to-Sequence Articulatory Inversion Through Time Convolution of Sub-Band Frequency Signals. In 21st Annual Conference of the International Speech Communication Association (pp. 2882-2886) [10.21437/Interspeech.2020-1140].

Sequence-to-Sequence Articulatory Inversion Through Time Convolution of Sub-Band Frequency Signals

Shahrebabaki, Abdolreza Sabzi;Siniscalchi, Sabato Marco^{Secondo

Supervision};Salvi, Giampiero;Svendsen, Torbjørn

2020-01-01

Abstract

We propose a new acoustic-to-articulatory inversion (AAI) sequence-to-sequence neural architecture, where spectral sub-bands are independently processed in time by 1-dimensional (1-D) convolutional filters of different sizes. The learned features maps are then combined and processed by a recurrent block with bi-directional long short-term memory (BLSTM) gates for preserving the smoothly varying nature of the articulatory trajectories. Our experimental evidence shows that, on a speaker dependent AAI task, in spite of the reduced number of parameters, our model demonstrates better root mean squared error (RMSE) and Pearson's correlation coefficient (PCC) than a both a BLSTM model and an FC-BLSTM model where the first stages are fully connected layers. In particular, the average RMSE goes from 1.401 when feeding the filterbank features directly into the BLSTM, to 1.328 with the FC-BLSTM model, and to 1.216 with the proposed method. Similarly, the average PCC increases from 0.859 to 0.877, and 0.895, respectively. On a speaker independent AAI task, we show that our convolutional features outperform the original filterbank features, and can be combined with phonetic features bringing independent information to the solution of the problem. To the best of the authors' knowledge, we report the best results on the given task and data.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data
	
				2020
			
	ISBN della monografia 
DATO PREVISTO SU LOGINMIUR
	
				978-1-7138-2069-7
			
	DOI del contributo 
DATO PREVISTO SU LOGINMIUR
	
				https://dx.doi.org/10.21437/Interspeech.2020-1140
			
	URL dell'editore (Open access ove possibile)
	
				http://www.interspeech2020.org/uploadfile/pdf/Wed-2-10-2.pdf
			
	Citazione
	
				Shahrebabaki, A.S., Siniscalchi, S.M., Salvi, G., Svendsen, T. (2020). Sequence-to-Sequence Articulatory Inversion Through Time Convolution of Sub-Band Frequency Signals. In 21st Annual Conference of the International Speech Communication Association (pp. 2882-2886) [10.21437/Interspeech.2020-1140].
			
	Appare nelle tipologie:
	
				2.07 Contributo in atti di convegno pubblicato in volume

File in questo prodotto:

File	Dimensione	Formato
1140.pdf Solo gestori archvio Descrizione: Il testo pieno dell’articolo è disponibile al seguente link: http://www.interspeech2020.org/uploadfile/pdf/Wed-2-10-2.pdf Tipologia: Versione Editoriale Dimensione 785.18 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	785.18 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/636620

Citazioni

ND

9

5

social impact