Archivio istituzionale della ricerca dell'Università degli Studi di Palermo

Motivation Information theoretic and compositional/linguistic analysis of genomes have a central role in bioinformatics, even more so since the associated methodologies are becoming very valuable also for epigenomic and meta-genomic studies. The kernel of those methods is based on the collection of k-mer statistics, i.e. how many times each k-mer in A,C,G,Tk occurs in a DNA sequence. Although this problem is computationally very simple and efficiently solvable on a conventional computer, the sheer amount of data available now in applications demands to resort to parallel and distributed computing. Indeed, those type of algorithms have been developed to collect k-mer statistics in the realm of genome assembly. However, they are so specialized to this domain that they do not extend easily to the computation of informational and linguistic indices, concurrently on sets of genomes. Results Following the well-established approach in many disciplines, and with a growing success also in bioinformatics, to resort to MapReduce and Hadoop to deal with 'Big Data' problems, we present KCH, the first set of MapReduce algorithms able to perform concurrently informational and linguistic analysis of large collections of genomic sequences on a Hadoop cluster. The benchmarking of KCH that we provide indicates that it is quite effective and versatile. It is also competitive with respect to the parallel and distributed algorithms highly specialized to k-mer statistics collection for genome assembly problems. In conclusion, KCH is a much needed addition to the growing number of algorithms and tools that use MapReduce for bioinformatics core applications.

Ferraro Petrillo, U., Roscigno, G., Cattaneo, G., Giancarlo, R. (2018). Informational and linguistic analysis of large genomic sequence collections via efficient Hadoop cluster algorithms. BIOINFORMATICS, 34(11), 1826-1833 [10.1093/bioinformatics/bty018].

Informational and linguistic analysis of large genomic sequence collections via efficient Hadoop cluster algorithms

Ferraro Petrillo, Umberto;Roscigno, Gianluca;Cattaneo, Giuseppe;Giancarlo, Raffaele

2018-01-01

Abstract

Motivation Information theoretic and compositional/linguistic analysis of genomes have a central role in bioinformatics, even more so since the associated methodologies are becoming very valuable also for epigenomic and meta-genomic studies. The kernel of those methods is based on the collection of k-mer statistics, i.e. how many times each k-mer in A,C,G,Tk occurs in a DNA sequence. Although this problem is computationally very simple and efficiently solvable on a conventional computer, the sheer amount of data available now in applications demands to resort to parallel and distributed computing. Indeed, those type of algorithms have been developed to collect k-mer statistics in the realm of genome assembly. However, they are so specialized to this domain that they do not extend easily to the computation of informational and linguistic indices, concurrently on sets of genomes. Results Following the well-established approach in many disciplines, and with a growing success also in bioinformatics, to resort to MapReduce and Hadoop to deal with 'Big Data' problems, we present KCH, the first set of MapReduce algorithms able to perform concurrently informational and linguistic analysis of large collections of genomic sequences on a Hadoop cluster. The benchmarking of KCH that we provide indicates that it is quite effective and versatile. It is also competitive with respect to the parallel and distributed algorithms highly specialized to k-mer statistics collection for genome assembly problems. In conclusion, KCH is a much needed addition to the growing number of algorithms and tools that use MapReduce for bioinformatics core applications.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data
	
				2018
			
	Titolo del periodico 
DATO PREVISTO SU LOGINMIUR
	
				BIOINFORMATICS
			
	DOI del contributo 
DATO PREVISTO SU LOGINMIUR
	
				https://dx.doi.org/10.1093/bioinformatics/bty018
			
	URL dell'editore (Open access ove possibile)
	
				https://academic.oup.com/bioinformatics/article/34/11/1826/4802227
			
	Citazione
	
				Ferraro Petrillo, U., Roscigno, G., Cattaneo, G., Giancarlo, R. (2018). Informational and linguistic analysis of large genomic sequence collections via efficient Hadoop cluster algorithms. BIOINFORMATICS, 34(11), 1826-1833 [10.1093/bioinformatics/bty018].
			
	Appare nelle tipologie:
	
				1.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
bty018.pdf Solo gestori archvio Descrizione: Principal Paper Tipologia: Versione Editoriale Dimensione 597.96 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	597.96 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/291365

Citazioni

7

20

16

social impact