Recent progress in the field of \{DNA\} sequencing motivates us to consider the problem of computing the Burrows–Wheeler transform (BWT) of a collection of strings. A human genome sequencing experiment might yield a billion or more sequences, each 100 characters in length. Such a dataset can now be generated in just a few days on a single sequencing machine. Many algorithms and data structures for compression and indexing of text have the \{BWT\} at their heart, and it would be of great interest to explore their applications to sequence collections such as these. However, computing the \{BWT\} for 100 billion characters or more of data remains a computational challenge. In this work we address this obstacle by presenting a methodology for computing the \{BWT\} of a string collection in a lightweight fashion. A first implementation of our algorithm needs O ( m log m ) bits of memory to process m strings, while a second variant makes additional use of external memory to achieve \{RAM\} usage that is constant with respect to m and negligible in size for a small alphabet such as DNA. The algorithms work on any number of strings and any size. We evaluate our algorithms on collections of up to 1 billion strings and compare their performance to other approaches on smaller datasets. We take further steps toward making the \{BWT\} a practical tool for processing string collections on this scale. First, we give two algorithms for recovering the strings in a collection from its BWT. Second, we show that if sequences are added to or removed from the collection, then the \{BWT\} of the original collection can be efficiently updated to obtain the \{BWT\} of the revised collection.

Bauer, M., Cox, A., Rosone, G. (2013). Lightweight algorithms for constructing and inverting the BWT of string collections. THEORETICAL COMPUTER SCIENCE, 483(0), 134-148 [10.1016/j.tcs.2012.02.002].

Lightweight algorithms for constructing and inverting the BWT of string collections

ROSONE, Giovanna
2013-01-01

Abstract

Recent progress in the field of \{DNA\} sequencing motivates us to consider the problem of computing the Burrows–Wheeler transform (BWT) of a collection of strings. A human genome sequencing experiment might yield a billion or more sequences, each 100 characters in length. Such a dataset can now be generated in just a few days on a single sequencing machine. Many algorithms and data structures for compression and indexing of text have the \{BWT\} at their heart, and it would be of great interest to explore their applications to sequence collections such as these. However, computing the \{BWT\} for 100 billion characters or more of data remains a computational challenge. In this work we address this obstacle by presenting a methodology for computing the \{BWT\} of a string collection in a lightweight fashion. A first implementation of our algorithm needs O ( m log m ) bits of memory to process m strings, while a second variant makes additional use of external memory to achieve \{RAM\} usage that is constant with respect to m and negligible in size for a small alphabet such as DNA. The algorithms work on any number of strings and any size. We evaluate our algorithms on collections of up to 1 billion strings and compare their performance to other approaches on smaller datasets. We take further steps toward making the \{BWT\} a practical tool for processing string collections on this scale. First, we give two algorithms for recovering the strings in a collection from its BWT. Second, we show that if sequences are added to or removed from the collection, then the \{BWT\} of the original collection can be efficiently updated to obtain the \{BWT\} of the revised collection.
2013
Bauer, M., Cox, A., Rosone, G. (2013). Lightweight algorithms for constructing and inverting the BWT of string collections. THEORETICAL COMPUTER SCIENCE, 483(0), 134-148 [10.1016/j.tcs.2012.02.002].
File in questo prodotto:
File Dimensione Formato  
BauerCoxRosone_TCS2013.pdf

Solo gestori archvio

Descrizione: Articolo principale
Dimensione 316.32 kB
Formato Adobe PDF
316.32 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/74890
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 124
  • ???jsp.display-item.citation.isi??? 54
social impact