Due to the increased availability of large datasets of biological sequences, tools for sequence comparison are now relying on efficient alignment-free approaches to a greater extent. Most alignment-free approaches require the computation of statistics when comparing sequences, even if such computations may not scale well in in internal memory when very large collections of long sequences are considered. In this paper, we present a new conceptual data structure, the colored longest common prefix array (cLCP), that allows to efficiently tackle several problems with an alignment-free approach. In fact, we show that such a data structure can be computed via sequential scans in semi-external memory. By using cLCP, we propose an efficient lightweight strategy to solve the multi-string Average Common Substring (ACS) problem, that consists in the pairwise comparison of a single string against a collection of m strings simultaneously, in order to obtain m ACS induced distances. Experimental results confirm the high practical efficiency of our approach.

Garofalo, F., Rosone, G., Sciortino, M., & Verzotto, D. (2018). The colored longest common prefix array computed via sequential scans. In T. Gagie, A. Moffat, G. Navarro, & E. Cuadros-Vargas (a cura di), String Processing and Information Retrieval, 25th International Symposium, SPIRE 2018, Lima, Peru, October 9-11, 2018, Proceedings (pp. 153-167). Springer Verlag [10.1007/978-3-030-00479-8_13].

The colored longest common prefix array computed via sequential scans

Rosone, Giovanna;Sciortino, Marinella;
2018

Abstract

Due to the increased availability of large datasets of biological sequences, tools for sequence comparison are now relying on efficient alignment-free approaches to a greater extent. Most alignment-free approaches require the computation of statistics when comparing sequences, even if such computations may not scale well in in internal memory when very large collections of long sequences are considered. In this paper, we present a new conceptual data structure, the colored longest common prefix array (cLCP), that allows to efficiently tackle several problems with an alignment-free approach. In fact, we show that such a data structure can be computed via sequential scans in semi-external memory. By using cLCP, we propose an efficient lightweight strategy to solve the multi-string Average Common Substring (ACS) problem, that consists in the pairwise comparison of a single string against a collection of m strings simultaneously, in order to obtain m ACS induced distances. Experimental results confirm the high practical efficiency of our approach.
978-3-030-00478-1
978-3-030-00479-8
https://link.springer.com/book/10.1007/978-3-030-00479-8
Garofalo, F., Rosone, G., Sciortino, M., & Verzotto, D. (2018). The colored longest common prefix array computed via sequential scans. In T. Gagie, A. Moffat, G. Navarro, & E. Cuadros-Vargas (a cura di), String Processing and Information Retrieval, 25th International Symposium, SPIRE 2018, Lima, Peru, October 9-11, 2018, Proceedings (pp. 153-167). Springer Verlag [10.1007/978-3-030-00479-8_13].
File in questo prodotto:
File Dimensione Formato  
Garofalo2018_Chapter_TheColoredLongestCommonPrefixA.pdf

Solo gestori archvio

Dimensione 395.79 kB
Formato Adobe PDF
395.79 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
GRSV_Spire18.pdf

accesso aperto

Tipologia: Post-print
Dimensione 372.74 kB
Formato Adobe PDF
372.74 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/10447/337234
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 3
  • ???jsp.display-item.citation.isi??? 3
social impact