Recent progress in the field of \{DNA\} sequencing motivates us to consider the problem of computing the Burrows–Wheeler transform (BWT) of a collection of strings. A human genome sequencing experiment might yield a billion or more sequences, each 100 characters in length. Such a dataset can now be generated in just a few days on a single sequencing machine. Many algorithms and data structures for compression and indexing of text have the \{BWT\} at their heart, and it would be of great interest to explore their applications to sequence collections such as these. However, computing the \{BWT\} for 100 billion characters or more of data remains a computational challenge. In this work we address this obstacle by presenting a methodology for computing the \{BWT\} of a string collection in a lightweight fashion. A first implementation of our algorithm needs O ( m log m ) bits of memory to process m strings, while a second variant makes additional use of external memory to achieve \{RAM\} usage that is constant with respect to m and negligible in size for a small alphabet such as DNA. The algorithms work on any number of strings and any size. We evaluate our algorithms on collections of up to 1 billion strings and compare their performance to other approaches on smaller datasets. We take further steps toward making the \{BWT\} a practical tool for processing string collections on this scale. First, we give two algorithms for recovering the strings in a collection from its BWT. Second, we show that if sequences are added to or removed from the collection, then the \{BWT\} of the original collection can be efficiently updated to obtain the \{BWT\} of the revised collection.
Bauer, M., Cox, A., Rosone, G. (2013). Lightweight algorithms for constructing and inverting the BWT of string collections. THEORETICAL COMPUTER SCIENCE, 483(0), 134-148 [10.1016/j.tcs.2012.02.002].
Lightweight algorithms for constructing and inverting the BWT of string collections
ROSONE, Giovanna
2013-01-01
Abstract
Recent progress in the field of \{DNA\} sequencing motivates us to consider the problem of computing the Burrows–Wheeler transform (BWT) of a collection of strings. A human genome sequencing experiment might yield a billion or more sequences, each 100 characters in length. Such a dataset can now be generated in just a few days on a single sequencing machine. Many algorithms and data structures for compression and indexing of text have the \{BWT\} at their heart, and it would be of great interest to explore their applications to sequence collections such as these. However, computing the \{BWT\} for 100 billion characters or more of data remains a computational challenge. In this work we address this obstacle by presenting a methodology for computing the \{BWT\} of a string collection in a lightweight fashion. A first implementation of our algorithm needs O ( m log m ) bits of memory to process m strings, while a second variant makes additional use of external memory to achieve \{RAM\} usage that is constant with respect to m and negligible in size for a small alphabet such as DNA. The algorithms work on any number of strings and any size. We evaluate our algorithms on collections of up to 1 billion strings and compare their performance to other approaches on smaller datasets. We take further steps toward making the \{BWT\} a practical tool for processing string collections on this scale. First, we give two algorithms for recovering the strings in a collection from its BWT. Second, we show that if sequences are added to or removed from the collection, then the \{BWT\} of the original collection can be efficiently updated to obtain the \{BWT\} of the revised collection.File | Dimensione | Formato | |
---|---|---|---|
BauerCoxRosone_TCS2013.pdf
Solo gestori archvio
Descrizione: Articolo principale
Dimensione
316.32 kB
Formato
Adobe PDF
|
316.32 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.