Archivio istituzionale della ricerca dell'Università degli Studi di Palermo

Shannon’s entropy is a definitive lower bound for statistical compression. Unfortunately, no such clear measure exists for the compressibility of repetitive strings. Thus, ad hoc measures are employed to estimate the repetitiveness of strings, e.g., the size z of the Lempel–Ziv parse or the number r of equal-letter runs of the Burrows-Wheeler transform. A more recent one is the size γ of a smallest string attractor. Let T be a string of length n. A string attractor of T is a set of positions of T capturing the occurrences of all the substrings of T. Unfortunately, Kempa and Prezza [STOC 2018] showed that computing γ is NP-hard. Kociumaka et al. [LATIN 2020] considered a new measure of compressibility that is based on the function S_T(k) counting the number of distinct substrings of length k of T, also known as the substring complexity of T. This new measure is defined as δ = sup{S_T(k)/k, k ≥ 1} and lower bounds all the relevant ad hoc measures previously considered. In particular, δ ≤ γ always holds and δ can be computed in 𝒪(n) time using Θ(n) working space. Kociumaka et al. showed that one can construct an 𝒪(δ log n/(δ))-sized representation of T supporting efficient direct access and efficient pattern matching queries on T. Given that for highly compressible strings, δ is significantly smaller than n, it is natural to pose the following question: Can we compute δ efficiently using sublinear working space? It is straightforward to show that in the comparison model, any algorithm computing δ using 𝒪(b) space requires Ω(n^{2-o(1)}/b) time through a reduction from the element distinctness problem [Yao, SIAM J. Comput. 1994]. We thus wanted to investigate whether we can indeed match this lower bound. We address this algorithmic challenge by showing the following bounds to compute δ: - 𝒪((n³log b)/b²) time using 𝒪(b) space, for any b ∈ [1,n], in the comparison model. - 𝒪̃(n²/b) time using 𝒪̃(b) space, for any b ∈ [√n,n], in the word RAM model. This gives an 𝒪̃(n^{1+ε})-time and 𝒪̃(n^{1-ε})-space algorithm to compute δ, for any 0 < ε ≤ 1/2. Let us remark that our algorithms compute S_T(k), for all k, within the same complexities.

Giulia Bernardini, G.F. (2023). Substring Complexity in Sublinear Space. In S. Iwata, N. Kakimura (a cura di), 34th International Symposium on Algorithms and Computation (ISAAC 2023). Schloss-Dagstuhl - Leibniz Zentrum für Informatik [10.4230/LIPIcs.ISAAC.2023.12].

Substring Complexity in Sublinear Space

Giulia Bernardini;Gabriele Fici;Paweł Gawrychowski;and Solon P. Pissis.

2023-11-28

Abstract

Shannon’s entropy is a definitive lower bound for statistical compression. Unfortunately, no such clear measure exists for the compressibility of repetitive strings. Thus, ad hoc measures are employed to estimate the repetitiveness of strings, e.g., the size z of the Lempel–Ziv parse or the number r of equal-letter runs of the Burrows-Wheeler transform. A more recent one is the size γ of a smallest string attractor. Let T be a string of length n. A string attractor of T is a set of positions of T capturing the occurrences of all the substrings of T. Unfortunately, Kempa and Prezza [STOC 2018] showed that computing γ is NP-hard. Kociumaka et al. [LATIN 2020] considered a new measure of compressibility that is based on the function S_T(k) counting the number of distinct substrings of length k of T, also known as the substring complexity of T. This new measure is defined as δ = sup{S_T(k)/k, k ≥ 1} and lower bounds all the relevant ad hoc measures previously considered. In particular, δ ≤ γ always holds and δ can be computed in 𝒪(n) time using Θ(n) working space. Kociumaka et al. showed that one can construct an 𝒪(δ log n/(δ))-sized representation of T supporting efficient direct access and efficient pattern matching queries on T. Given that for highly compressible strings, δ is significantly smaller than n, it is natural to pose the following question: Can we compute δ efficiently using sublinear working space? It is straightforward to show that in the comparison model, any algorithm computing δ using 𝒪(b) space requires Ω(n^{2-o(1)}/b) time through a reduction from the element distinctness problem [Yao, SIAM J. Comput. 1994]. We thus wanted to investigate whether we can indeed match this lower bound. We address this algorithmic challenge by showing the following bounds to compute δ: - 𝒪((n³log b)/b²) time using 𝒪(b) space, for any b ∈ [1,n], in the comparison model. - 𝒪̃(n²/b) time using 𝒪̃(b) space, for any b ∈ [√n,n], in the word RAM model. This gives an 𝒪̃(n^{1+ε})-time and 𝒪̃(n^{1-ε})-space algorithm to compute δ, for any 0 < ε ≤ 1/2. Let us remark that our algorithms compute S_T(k), for all k, within the same complexities.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data
	
				28-nov-2023
			
	ISBN della monografia 
DATO PREVISTO SU LOGINMIUR
	
				978-3-95977-289-1
			
	DOI del contributo 
DATO PREVISTO SU LOGINMIUR
	
				https://dx.doi.org/10.4230/LIPIcs.ISAAC.2023.12
			
	URL dell'editore (Open access ove possibile)
	
				https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ISAAC.2023.12
			
	Citazione
	
				Giulia Bernardini, G.F. (2023). Substring Complexity in Sublinear Space. In S. Iwata, N. Kakimura (a cura di), 34th International Symposium on Algorithms and Computation (ISAAC 2023). Schloss-Dagstuhl - Leibniz Zentrum für Informatik [10.4230/LIPIcs.ISAAC.2023.12].
			
	Appare nelle tipologie:
	
				2.07 Contributo in atti di convegno pubblicato in volume

File in questo prodotto:

File	Dimensione	Formato
LIPIcs.ISAAC.2023.12.pdf accesso aperto Tipologia: Versione Editoriale Dimensione 990.9 kB Formato Adobe PDF Visualizza/Apri	990.9 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/619240

Citazioni

ND

3

1

social impact