Sampled string matching is an efficient approach to the string matching problem introduced in order to overcome the prohibitive space requirements of indexed matching, on the one hand, and drastically re- duce searching time for the online solutions, on the other hand. Known solutions to sampled string matching are able to speed up the searching up to 9 times while using less than 11% of the text size. However, they appear to work efficiently only in the case of natural language texts or, in general, when searching on input sequences over large alphabets. In this paper we extend sampled-string matching to the case of small al- phabets obtaining a new efficient solution which turns out to be feasible also for searching biological data like genome or protein sequences. Our solution extends a recent approach called Character Distance Sampling by using condensed characters in order to enlarge the size of the alpha- bet and speed up the searching process. From our experimental results it turns out that our solution significantly reduces the searching time when compared against previous sampled string matching algorithms. In particular on biological data it leads to reduce the space consumption up to 80% and to speed up the searching up to 96%, thus improving the standard online string matching algorithm up to 99.6% using less than 1% of the original text size.

Faro S., Marino F.P., Pavone A. (2021). Enhancing Characters Distance Text Sampling by Condensed Alphabets. In C. Sacerdoti Coen, I. Salvo (a cura di), Proceedings of the 22nd Italian Conference on Theoretical Computer Science (pp. 1-15).

Enhancing Characters Distance Text Sampling by Condensed Alphabets

Pavone A.
2021-01-01

Abstract

Sampled string matching is an efficient approach to the string matching problem introduced in order to overcome the prohibitive space requirements of indexed matching, on the one hand, and drastically re- duce searching time for the online solutions, on the other hand. Known solutions to sampled string matching are able to speed up the searching up to 9 times while using less than 11% of the text size. However, they appear to work efficiently only in the case of natural language texts or, in general, when searching on input sequences over large alphabets. In this paper we extend sampled-string matching to the case of small al- phabets obtaining a new efficient solution which turns out to be feasible also for searching biological data like genome or protein sequences. Our solution extends a recent approach called Character Distance Sampling by using condensed characters in order to enlarge the size of the alpha- bet and speed up the searching process. From our experimental results it turns out that our solution significantly reduces the searching time when compared against previous sampled string matching algorithms. In particular on biological data it leads to reduce the space consumption up to 80% and to speed up the searching up to 96%, thus improving the standard online string matching algorithm up to 99.6% using less than 1% of the original text size.
2021
Settore INF/01 - Informatica
Faro S., Marino F.P., Pavone A. (2021). Enhancing Characters Distance Text Sampling by Condensed Alphabets. In C. Sacerdoti Coen, I. Salvo (a cura di), Proceedings of the 22nd Italian Conference on Theoretical Computer Science (pp. 1-15).
File in questo prodotto:
File Dimensione Formato  
Pubblicazione 2021-ICTCS.pdf

accesso aperto

Tipologia: Versione Editoriale
Dimensione 150.83 kB
Formato Adobe PDF
150.83 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/640632
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? ND
social impact