Archivio istituzionale della ricerca dell'Università degli Studi di Palermo

Bottleneck features (BNFs) generated with a deep neural network (DNN) have proven to boost spoken language recognition accuracy over basic spectral features significantly. However, BNFs are commonly extracted using language-dependent tied-context phone states as learning targets. Moreover, BNFs are less phonetically expressive than the output layer in a DNN, which is usually not used as a speech feature because of its very high dimensionality hindering further post-processing. In this work, we put forth a novel deep learning framework to overcome all of the above issues and evaluate it on the 2017 NIST Language Recognition Evaluation (LRE) challenge. We use manner and place of articulation as speech attributes, which lead to low-dimensional “universal” phonetic features that can be defined across all spoken languages. To model the asynchronous nature of the speech attributes while capturing their intrinsic relationships in a given speech segment, we introduce a new training scheme for deep architectures based on a Maximal Figure of Merit (MFoM) objective. MFoM introduces non-differentiable metrics into the backpropagation-based approach, which is elegantly solved in the proposed framework. The experimental evidence collected on the recent NIST LRE 2017 challenge demonstrates the effectiveness of our solution. In fact, the performance of speech language recognition (SLR) systems based on spectral features is improved for more than 5% absolute Cavg. Finally, the F1 metric can be brought from 77.6% up to 78.1% by combining the conventional baseline phonetic BNFs with the proposed articulatory attribute features.

I. Kukanov, T. Trong, V. Hautamaki, S. M. SINISCALCHI, V. M. Salerno, K. A. Lee (2020). Maximal Figure-of-Merit Framework to Detect Multi-label Phonetic Features for Spoken Language Recognition. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 28, 682-695 [10.1109/TASLP.2020.2964953].

Maximal Figure-of-Merit Framework to Detect Multi-label Phonetic Features for Spoken Language Recognition

I. Kukanov;T. Trong;V. Hautamaki;S. M. SINISCALCHI^{Writing – Original Draft Preparation};V. M. Salerno;K. A. Lee

2020-01-08

Abstract

Bottleneck features (BNFs) generated with a deep neural network (DNN) have proven to boost spoken language recognition accuracy over basic spectral features significantly. However, BNFs are commonly extracted using language-dependent tied-context phone states as learning targets. Moreover, BNFs are less phonetically expressive than the output layer in a DNN, which is usually not used as a speech feature because of its very high dimensionality hindering further post-processing. In this work, we put forth a novel deep learning framework to overcome all of the above issues and evaluate it on the 2017 NIST Language Recognition Evaluation (LRE) challenge. We use manner and place of articulation as speech attributes, which lead to low-dimensional “universal” phonetic features that can be defined across all spoken languages. To model the asynchronous nature of the speech attributes while capturing their intrinsic relationships in a given speech segment, we introduce a new training scheme for deep architectures based on a Maximal Figure of Merit (MFoM) objective. MFoM introduces non-differentiable metrics into the backpropagation-based approach, which is elegantly solved in the proposed framework. The experimental evidence collected on the recent NIST LRE 2017 challenge demonstrates the effectiveness of our solution. In fact, the performance of speech language recognition (SLR) systems based on spectral features is improved for more than 5% absolute Cavg. Finally, the F1 metric can be brought from 77.6% up to 78.1% by combining the conventional baseline phonetic BNFs with the proposed articulatory attribute features.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data
	
				8-gen-2020
			
	Titolo del periodico 
DATO PREVISTO SU LOGINMIUR
	
				IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING
			
	DOI del contributo 
DATO PREVISTO SU LOGINMIUR
	
				https://dx.doi.org/10.1109/TASLP.2020.2964953
			
	URL dell'editore (Open access ove possibile)
	
				https://ieeexplore.ieee.org/document/8952610
			
	Citazione
	
				I. Kukanov,  T. Trong,  V. Hautamaki,  S. M. SINISCALCHI,  V. M. Salerno,  K. A. Lee (2020). Maximal Figure-of-Merit Framework to Detect Multi-label Phonetic Features for Spoken Language Recognition. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 28, 682-695 [10.1109/TASLP.2020.2964953].
			
	Appare nelle tipologie:
	
				1.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
08952610.pdf accesso aperto Tipologia: Versione Editoriale Dimensione 2.54 MB Formato Adobe PDF Visualizza/Apri	2.54 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/636463

Citazioni

ND

13

13

social impact