In this paper we investigate similarity measures based on minimal absent words, introduced by Chairungsee and Crochemore in [1]. They make use of a length-weighted index on a sample set corresponding to the symmetric difference M(x)ΔM(y) of the minimal absent words M(x) and M(y) of two sequences x and y, respectively. We first propose a variant of this measure by choosing as a sample set a proper subset (x, y) of M(x)ΔM(y), which appears to be more appropriate for distinguishing x and y. From the algebraic point of view, we prove that (x, y) is the base of the ideal generated by M(x)ΔM(y). We then remark that such measures are able to recognize whether the sequences x and y share a common structure, but they are not able to detect the difference on the number of occurrences of such a structure in the two sequences. In order to take into account such a multiplicity, we introduce the notion of multifactor, and define a new measure that uses both absent words and multifactors. Surprisingly, we prove that this similarity measure coincides with a distance on sequences introduced by Ehrenfeucht and Haussler in [2], in the context of block-moves strategies. In this way, our result creates a non trivial bridge between similarity measures based on absent words and those based on the block-moves approach.

Castiglione G., Mantaci S., Restivo A. (2019). Some Investigations on Similarity Measures Based on Absent Words. FUNDAMENTA INFORMATICAE, 171(1-4), 97-112 [10.3233/FI-2020-1874].

Some Investigations on Similarity Measures Based on Absent Words

Castiglione G.
;
Mantaci S.;
2019-01-01

Abstract

In this paper we investigate similarity measures based on minimal absent words, introduced by Chairungsee and Crochemore in [1]. They make use of a length-weighted index on a sample set corresponding to the symmetric difference M(x)ΔM(y) of the minimal absent words M(x) and M(y) of two sequences x and y, respectively. We first propose a variant of this measure by choosing as a sample set a proper subset (x, y) of M(x)ΔM(y), which appears to be more appropriate for distinguishing x and y. From the algebraic point of view, we prove that (x, y) is the base of the ideal generated by M(x)ΔM(y). We then remark that such measures are able to recognize whether the sequences x and y share a common structure, but they are not able to detect the difference on the number of occurrences of such a structure in the two sequences. In order to take into account such a multiplicity, we introduce the notion of multifactor, and define a new measure that uses both absent words and multifactors. Surprisingly, we prove that this similarity measure coincides with a distance on sequences introduced by Ehrenfeucht and Haussler in [2], in the context of block-moves strategies. In this way, our result creates a non trivial bridge between similarity measures based on absent words and those based on the block-moves approach.
2019
Settore INF/01 - Informatica
Castiglione G., Mantaci S., Restivo A. (2019). Some Investigations on Similarity Measures Based on Absent Words. FUNDAMENTA INFORMATICAE, 171(1-4), 97-112 [10.3233/FI-2020-1874].
File in questo prodotto:
File Dimensione Formato  
Distances-2.pdf

accesso aperto

Descrizione: articolo
Tipologia: Post-print
Dimensione 215.2 kB
Formato Adobe PDF
215.2 kB Adobe PDF Visualizza/Apri
2620105.pdf

Solo gestori archvio

Tipologia: Versione Editoriale
Dimensione 5.08 MB
Formato Adobe PDF
5.08 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/429666
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 5
  • ???jsp.display-item.citation.isi??? 2
social impact