With the rapid growth of Next Generation Sequencing (NGS) technologies, large amounts of "omics" data are daily collected and need to be processed. Indexing and compressing large sequences datasets are some of the most important tasks in this context. Here, we propose a novel approach for the computation of Burrows Wheeler transform relying on Big Data technologies, i.e., Apache Spark and Hadoop. We implement three algorithms based on the MapReduce framework, distributing the index computation and not only the input dataset, differently than previous approaches from the literature. Experimental results performed on real datasets show that the proposed approach is promising.
Galluzzo, Y., Giancarlo, R., Randazzo, M., Rombo, S.E. (2026). Burrows Wheeler Transform on a Large Scale: Algorithms Implemented in Apache Spark †. DATA, 11(3) [10.3390/data11030048].
Burrows Wheeler Transform on a Large Scale: Algorithms Implemented in Apache Spark †
Galluzzo Y.;Giancarlo R.
;Rombo S. E.
2026-03-01
Abstract
With the rapid growth of Next Generation Sequencing (NGS) technologies, large amounts of "omics" data are daily collected and need to be processed. Indexing and compressing large sequences datasets are some of the most important tasks in this context. Here, we propose a novel approach for the computation of Burrows Wheeler transform relying on Big Data technologies, i.e., Apache Spark and Hadoop. We implement three algorithms based on the MapReduce framework, distributing the index computation and not only the input dataset, differently than previous approaches from the literature. Experimental results performed on real datasets show that the proposed approach is promising.| File | Dimensione | Formato | |
|---|---|---|---|
|
data-11-00048.pdf
accesso aperto
Tipologia:
Versione Editoriale
Dimensione
809.77 kB
Formato
Adobe PDF
|
809.77 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


