Archivio istituzionale della ricerca dell'Università degli Studi di Palermo

Recent progress in the field of \{DNA\} sequencing motivates us to consider the problem of computing the Burrows‚ÄìWheeler transform (BWT) of a collection of strings. A human genome sequencing experiment might yield a billion or more sequences, each 100 characters in length. Such a dataset can now be generated in just a few days on a single sequencing machine. Many algorithms and data structures for compression and indexing of text have the \{BWT\} at their heart, and it would be of great interest to explore their applications to sequence collections such as these. However, computing the \{BWT\} for 100 billion characters or more of data remains a computational challenge. In this work we address this obstacle by presenting a methodology for computing the \{BWT\} of a string collection in a lightweight fashion. A first implementation of our algorithm needs O ( m log m ) bits of memory to process m strings, while a second variant makes additional use of external memory to achieve \{RAM\} usage that is constant with respect to m and negligible in size for a small alphabet such as DNA. The algorithms work on any number of strings and any size. We evaluate our algorithms on collections of up to 1 billion strings and compare their performance to other approaches on smaller datasets. We take further steps toward making the \{BWT\} a practical tool for processing string collections on this scale. First, we give two algorithms for recovering the strings in a collection from its BWT. Second, we show that if sequences are added to or removed from the collection, then the \{BWT\} of the original collection can be efficiently updated to obtain the \{BWT\} of the revised collection.

Bauer, M., Cox, A., Rosone, G. (2013). Lightweight algorithms for constructing and inverting the BWT of string collections. THEORETICAL COMPUTER SCIENCE, 483(0), 134-148 [10.1016/j.tcs.2012.02.002].

Lightweight algorithms for constructing and inverting the BWT of string collections

Bauer, M;Cox, A;ROSONE, Giovanna

2013-01-01

Abstract

Recent progress in the field of \{DNA\} sequencing motivates us to consider the problem of computing the Burrows‚ÄìWheeler transform (BWT) of a collection of strings. A human genome sequencing experiment might yield a billion or more sequences, each 100 characters in length. Such a dataset can now be generated in just a few days on a single sequencing machine. Many algorithms and data structures for compression and indexing of text have the \{BWT\} at their heart, and it would be of great interest to explore their applications to sequence collections such as these. However, computing the \{BWT\} for 100 billion characters or more of data remains a computational challenge. In this work we address this obstacle by presenting a methodology for computing the \{BWT\} of a string collection in a lightweight fashion. A first implementation of our algorithm needs O ( m log m ) bits of memory to process m strings, while a second variant makes additional use of external memory to achieve \{RAM\} usage that is constant with respect to m and negligible in size for a small alphabet such as DNA. The algorithms work on any number of strings and any size. We evaluate our algorithms on collections of up to 1 billion strings and compare their performance to other approaches on smaller datasets. We take further steps toward making the \{BWT\} a practical tool for processing string collections on this scale. First, we give two algorithms for recovering the strings in a collection from its BWT. Second, we show that if sequences are added to or removed from the collection, then the \{BWT\} of the original collection can be efficiently updated to obtain the \{BWT\} of the revised collection.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data
	
				2013
			
	Titolo del periodico 
DATO PREVISTO SU LOGINMIUR
	
				THEORETICAL COMPUTER SCIENCE
			
	DOI del contributo 
DATO PREVISTO SU LOGINMIUR
	
				https://dx.doi.org/10.1016/j.tcs.2012.02.002
			
	Citazione
	
				Bauer, M., Cox, A., Rosone, G. (2013). Lightweight algorithms for constructing and inverting the BWT of string collections. THEORETICAL COMPUTER SCIENCE, 483(0), 134-148 [10.1016/j.tcs.2012.02.002].
			
	Appare nelle tipologie:
	
				1.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
BauerCoxRosone_TCS2013.pdf Solo gestori archvio Descrizione: Articolo principale Dimensione 316.32 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	316.32 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/74890

Citazioni

ND

131

72

social impact