Archivio istituzionale della ricerca dell'Università degli Studi di Palermo

The advent of "next-generation" DNA sequencing (NGS) technologies has meant that collections of hundreds of millions of DNA sequences are now commonplace in bioinformatics. Knowing the longest common prefix array (LCP) of such a collection would facilitate the rapid computation of maximal exact matches, shortest unique substrings and shortest absent words. CPU-efficient algorithms for computing the LCP of a string have been described in the literature, but require the presence in RAM of large data structures. This prevents such methods from being feasible for NGS datasets. In this paper we propose the first lightweight method that simultaneously computes, via sequential scans, the LCP and BWT of very large collections of sequences. Computational results on collections as large as 800 million 100-mers demonstrate that our algorithm scales to the vast sequence collections encountered in human whole genome sequencing experiments.

Bauer, M.J., Cox, A.J., Rosone, G., Sciortino, M. (2012). Lightweight LCP Construction for Next-Generation Sequencing Datasets. In ALGORITHMS IN BIOINFORMATICS (pp. 326-337). Springer Berlin / Heidelberg [10.1007/978-3-642-33122-0_26].

Lightweight LCP Construction for Next-Generation Sequencing Datasets

Bauer, MJ;Cox, AJ;Rosone, G;SCIORTINO, Marinella

2012-01-01

Abstract

The advent of "next-generation" DNA sequencing (NGS) technologies has meant that collections of hundreds of millions of DNA sequences are now commonplace in bioinformatics. Knowing the longest common prefix array (LCP) of such a collection would facilitate the rapid computation of maximal exact matches, shortest unique substrings and shortest absent words. CPU-efficient algorithms for computing the LCP of a string have been described in the literature, but require the presence in RAM of large data structures. This prevents such methods from being feasible for NGS datasets. In this paper we propose the first lightweight method that simultaneously computes, via sequential scans, the LCP and BWT of very large collections of sequences. Computational results on collections as large as 800 million 100-mers demonstrate that our algorithm scales to the vast sequence collections encountered in human whole genome sequencing experiments.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data
	
			2012
		
	Settore scientifico disciplinare del contributo
	
			Settore INF/01 - Informatica
		
	DOI del contributo 
DATO PREVISTO SU LOGINMIUR
	
			https://dx.doi.org/10.1007/978-3-642-33122-0_26
		
	URL alternativo rispetto a quello dell'editore 
DATO PREVISTO SU LOGINMIUR
	
			http://www.springerlink.com/content/1750555710m65834/
		
	Citazione
	
			Bauer, M.J., Cox, A.J., Rosone, G., Sciortino, M. (2012). Lightweight LCP Construction for Next-Generation Sequencing Datasets. In ALGORITHMS IN BIOINFORMATICS (pp. 326-337). Springer Berlin / Heidelberg [10.1007/978-3-642-33122-0_26].
		
	Appare nelle tipologie:
	
			2.01 Capitolo o Saggio

File in questo prodotto:

File	Dimensione	Formato
BauerCoxRosoneSciortino_LCP_WABI_2012.pdf Solo gestori archvio Descrizione: Articolo principale Dimensione 224.49 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	224.49 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/65665

Citazioni

ND

27

ND

social impact