Limited diversity in standardized benchmarks for evaluating audio representation learning (ARL) methods may hinder systematic comparison of current methods' capabilities. We present ARCH, a comprehensive benchmark for evaluating ARL methods on diverse audio classification domains, covering acoustic events, music, and speech. ARCH comprises 12 datasets, that allow us to thoroughly assess pre-trained SSL models of different sizes. ARCH streamlines benchmarking of ARL techniques through its unified access to a wide range of domains and its ability to readily incorporate new datasets and models. To address the current lack of open-source, pre-trained models for non-speech audio, we also release new pre-trained models that demonstrate strong performance on non-speech datasets. We argue that the presented wide-ranging evaluation provides valuable insights into state-of-the-art ARL methods, and is useful to pinpoint promising research directions.

La Quatra M., Koudounas A., Vaiani L., Baralis E., Cagliero L., Garza P., et al. (2024). Benchmarking Representations for Speech, Music, and Acoustic Events. In 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings (pp. 505-509). Institute of Electrical and Electronics Engineers Inc. [10.1109/ICASSPW62465.2024.10625960].

Benchmarking Representations for Speech, Music, and Acoustic Events

Siniscalchi S. M.
2024-01-01

Abstract

Limited diversity in standardized benchmarks for evaluating audio representation learning (ARL) methods may hinder systematic comparison of current methods' capabilities. We present ARCH, a comprehensive benchmark for evaluating ARL methods on diverse audio classification domains, covering acoustic events, music, and speech. ARCH comprises 12 datasets, that allow us to thoroughly assess pre-trained SSL models of different sizes. ARCH streamlines benchmarking of ARL techniques through its unified access to a wide range of domains and its ability to readily incorporate new datasets and models. To address the current lack of open-source, pre-trained models for non-speech audio, we also release new pre-trained models that demonstrate strong performance on non-speech datasets. We argue that the presented wide-ranging evaluation provides valuable insights into state-of-the-art ARL methods, and is useful to pinpoint promising research directions.
2024
979-8-3503-7451-3
La Quatra M., Koudounas A., Vaiani L., Baralis E., Cagliero L., Garza P., et al. (2024). Benchmarking Representations for Speech, Music, and Acoustic Events. In 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings (pp. 505-509). Institute of Electrical and Electronics Engineers Inc. [10.1109/ICASSPW62465.2024.10625960].
File in questo prodotto:
File Dimensione Formato  
Benchmarking_Representations_for_Speech_Music_and_Acoustic_Events.pdf

Solo gestori archvio

Tipologia: Versione Editoriale
Dimensione 844.26 kB
Formato Adobe PDF
844.26 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/663738
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 18
  • ???jsp.display-item.citation.isi??? 7
social impact