We present a neural architecture for speech recognition over the telephone. In telephone conversations, the speakers are already separated into separate channels. Although this is mostly an advantage, the separation also removes important contextual information from the other speaker. Earlier approaches have been proposed to address this problem, but 1) they do not precisely model the temporal relationship of the two channels, or 2) the model only has access to context in the form of text. We propose a Transformer model that uses cross-attention between the two channels of a telephone conversation and uses positional encodings that provide the model with the accurate temporal relationship of the two channels. Our empirical results on the Fisher, CallHome and Switchboard datasets show that our model outperforms the HuBERT baseline by a significant margin. We also provide an analysis of the cross-attention maps that shed some light on the internal workings of the model.

Dymbe, S., Siniscalchi, S.M., Svendsen, T., Salvi, G. (2026). Using Cross-Attention for Conversational ASR over the Telephone. In TSD 2025 (pp. 394-405) [10.1007/978-3-032-02548-7_33].

Using Cross-Attention for Conversational ASR over the Telephone

Siniscalchi S. M.;
2026-01-01

Abstract

We present a neural architecture for speech recognition over the telephone. In telephone conversations, the speakers are already separated into separate channels. Although this is mostly an advantage, the separation also removes important contextual information from the other speaker. Earlier approaches have been proposed to address this problem, but 1) they do not precisely model the temporal relationship of the two channels, or 2) the model only has access to context in the form of text. We propose a Transformer model that uses cross-attention between the two channels of a telephone conversation and uses positional encodings that provide the model with the accurate temporal relationship of the two channels. Our empirical results on the Fisher, CallHome and Switchboard datasets show that our model outperforms the HuBERT baseline by a significant margin. We also provide an analysis of the cross-attention maps that shed some light on the internal workings of the model.
2026
Settore IINF-05/A - Sistemi di elaborazione delle informazioni
9783032025470
9783032025487
Dymbe, S., Siniscalchi, S.M., Svendsen, T., Salvi, G. (2026). Using Cross-Attention for Conversational ASR over the Telephone. In TSD 2025 (pp. 394-405) [10.1007/978-3-032-02548-7_33].
File in questo prodotto:
File Dimensione Formato  
978-3-032-02548-7_33 (1).pdf

Solo gestori archvio

Tipologia: Versione Editoriale
Dimensione 5.97 MB
Formato Adobe PDF
5.97 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/689285
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact