We present a neural architecture for speech recognition over the telephone. In telephone conversations, the speakers are already separated into separate channels. Although this is mostly an advantage, the separation also removes important contextual information from the other speaker. Earlier approaches have been proposed to address this problem, but 1) they do not precisely model the temporal relationship of the two channels, or 2) the model only has access to context in the form of text. We propose a Transformer model that uses cross-attention between the two channels of a telephone conversation and uses positional encodings that provide the model with the accurate temporal relationship of the two channels. Our empirical results on the Fisher, CallHome and Switchboard datasets show that our model outperforms the HuBERT baseline by a significant margin. We also provide an analysis of the cross-attention maps that shed some light on the internal workings of the model.
Dymbe, S., Siniscalchi, S.M., Svendsen, T., Salvi, G. (2026). Using Cross-Attention for Conversational ASR over the Telephone. In TSD 2025 (pp. 394-405) [10.1007/978-3-032-02548-7_33].
Using Cross-Attention for Conversational ASR over the Telephone
Siniscalchi S. M.;
2026-01-01
Abstract
We present a neural architecture for speech recognition over the telephone. In telephone conversations, the speakers are already separated into separate channels. Although this is mostly an advantage, the separation also removes important contextual information from the other speaker. Earlier approaches have been proposed to address this problem, but 1) they do not precisely model the temporal relationship of the two channels, or 2) the model only has access to context in the form of text. We propose a Transformer model that uses cross-attention between the two channels of a telephone conversation and uses positional encodings that provide the model with the accurate temporal relationship of the two channels. Our empirical results on the Fisher, CallHome and Switchboard datasets show that our model outperforms the HuBERT baseline by a significant margin. We also provide an analysis of the cross-attention maps that shed some light on the internal workings of the model.| File | Dimensione | Formato | |
|---|---|---|---|
|
978-3-032-02548-7_33 (1).pdf
Solo gestori archvio
Tipologia:
Versione Editoriale
Dimensione
5.97 MB
Formato
Adobe PDF
|
5.97 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


