Archivio istituzionale della ricerca dell'Università degli Studi di Palermo

Compressive genomics leverages compressed data representations to enhance the efficiency of bioinformatics tasks like sequence comparison and search. Surprisingly, the fundamental operation of pattern matching on large DNA sequence collections remains unexplored in the realm of genomic analysis. However, distributed systems like Spark offer the scalability necessary to process increasingly large genomic datasets efficiently. We present the first Spark-based implementation of the FM-Index and Compressed Boyer-Moore (CBM) algorithms, evaluating their performance and providing insights into their advantages for large-scale bioinformatics applications. A comprehensive experimental study demonstrates clear performance gains over uncompressed approaches. Furthermore, we introduce SparkGeco, a distributed compressive genomics software library designed to simplify the integration of FM-Index and CBM algorithms into DNA sequence analysis pipelines within Apache Spark, thus supporting the development of efficient and scalable genomic analysis workflows. This work provides a concrete step towards high-performance, data-centric eScience solutions in computational biology.

Rocco, L.D., Ferraro Petrillo, U., Giancarlo, R., Cattaneo, G. (2026). Distributed compressive genomics: Fundamental pattern matching primitives via spark. FUTURE GENERATION COMPUTER SYSTEMS, 176 [10.1016/j.future.2025.108169].

Distributed compressive genomics: Fundamental pattern matching primitives via spark

Rocco, Lorenzo Di;Ferraro Petrillo, Umberto;Giancarlo, Raffaele;Cattaneo, Giuseppe

2026-03-01

Abstract

Compressive genomics leverages compressed data representations to enhance the efficiency of bioinformatics tasks like sequence comparison and search. Surprisingly, the fundamental operation of pattern matching on large DNA sequence collections remains unexplored in the realm of genomic analysis. However, distributed systems like Spark offer the scalability necessary to process increasingly large genomic datasets efficiently. We present the first Spark-based implementation of the FM-Index and Compressed Boyer-Moore (CBM) algorithms, evaluating their performance and providing insights into their advantages for large-scale bioinformatics applications. A comprehensive experimental study demonstrates clear performance gains over uncompressed approaches. Furthermore, we introduce SparkGeco, a distributed compressive genomics software library designed to simplify the integration of FM-Index and CBM algorithms into DNA sequence analysis pipelines within Apache Spark, thus supporting the development of efficient and scalable genomic analysis workflows. This work provides a concrete step towards high-performance, data-centric eScience solutions in computational biology.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data
	
				mar-2026
			
	Settore scientifico disciplinare del contributo
	
				Settore INFO-01/A - Informatica
			
	Titolo del periodico 
DATO PREVISTO SU LOGINMIUR
	
				FUTURE GENERATION COMPUTER SYSTEMS
			
	DOI del contributo 
DATO PREVISTO SU LOGINMIUR
	
				https://dx.doi.org/10.1016/j.future.2025.108169
			
	URL dell'editore (Open access ove possibile)
	
				https://www.sciencedirect.com/science/article/pii/S0167739X25004637?via=ihub
			
	Citazione
	
				Rocco, L.D., Ferraro Petrillo, U., Giancarlo, R., Cattaneo, G. (2026). Distributed compressive genomics: Fundamental pattern matching primitives via spark. FUTURE GENERATION COMPUTER SYSTEMS, 176 [10.1016/j.future.2025.108169].
			
	Appare nelle tipologie:
	
				1.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
1-s2.0-S0167739X25004637-main_compressed.pdf accesso aperto Tipologia: Versione Editoriale Dimensione 9.44 MB Formato Adobe PDF Visualizza/Apri	9.44 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/692252

Citazioni

ND

1

1

social impact