Archivio istituzionale della ricerca dell'Università degli Studi di Palermo

We present a neural architecture for speech recognition over the telephone. In telephone conversations, the speakers are already separated into separate channels. Although this is mostly an advantage, the separation also removes important contextual information from the other speaker. Earlier approaches have been proposed to address this problem, but 1) they do not precisely model the temporal relationship of the two channels, or 2) the model only has access to context in the form of text. We propose a Transformer model that uses cross-attention between the two channels of a telephone conversation and uses positional encodings that provide the model with the accurate temporal relationship of the two channels. Our empirical results on the Fisher, CallHome and Switchboard datasets show that our model outperforms the HuBERT baseline by a significant margin. We also provide an analysis of the cross-attention maps that shed some light on the internal workings of the model.

Dymbe, S., Siniscalchi, S.M., Svendsen, T., Salvi, G. (2026). Using Cross-Attention for Conversational ASR over the Telephone. In TSD 2025 (pp. 394-405) [10.1007/978-3-032-02548-7_33].

Using Cross-Attention for Conversational ASR over the Telephone

Dymbe S.;Siniscalchi S. M.;Svendsen T.;Salvi G.

2026-01-01

Abstract

We present a neural architecture for speech recognition over the telephone. In telephone conversations, the speakers are already separated into separate channels. Although this is mostly an advantage, the separation also removes important contextual information from the other speaker. Earlier approaches have been proposed to address this problem, but 1) they do not precisely model the temporal relationship of the two channels, or 2) the model only has access to context in the form of text. We propose a Transformer model that uses cross-attention between the two channels of a telephone conversation and uses positional encodings that provide the model with the accurate temporal relationship of the two channels. Our empirical results on the Fisher, CallHome and Switchboard datasets show that our model outperforms the HuBERT baseline by a significant margin. We also provide an analysis of the cross-attention maps that shed some light on the internal workings of the model.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data
	
				2026
			
	Settore scientifico disciplinare del contributo
	
				Settore IINF-05/A - Sistemi di elaborazione delle informazioni
			
	ISBN della monografia 
DATO PREVISTO SU LOGINMIUR
	
				9783032025470
9783032025487
			
	DOI del contributo 
DATO PREVISTO SU LOGINMIUR
	
				https://dx.doi.org/10.1007/978-3-032-02548-7_33
			
	URL dell'editore (Open access ove possibile)
	
				https://link.springer.com/chapter/10.1007/978-3-032-02548-7_33#citeas
			
	Citazione
	
				Dymbe, S., Siniscalchi, S.M., Svendsen, T., Salvi, G. (2026). Using Cross-Attention for Conversational ASR over the Telephone. In TSD 2025 (pp. 394-405) [10.1007/978-3-032-02548-7_33].
			
	Appare nelle tipologie:
	
				2.07 Contributo in atti di convegno pubblicato in volume

File in questo prodotto:

File	Dimensione	Formato
978-3-032-02548-7_33 (1).pdf Solo gestori archvio Tipologia: Versione Editoriale Dimensione 5.97 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	5.97 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/689285

Citazioni

ND

0

ND

social impact