Background: In several contexts involving large collections of sets of biological sequences, a relevant problem is that of selecting significant groups of k-mers that characterize one set with regards to the others in the same collection. Results: Here a software framework is proposed implementing a novel methodology for the extraction of k-mer dictionaries, from multiple sets of biological sequences. It has been implemented according to the most recent technologies for Big Data analytics, with the perspective of allowing its usage with a variety of input datasets of any size. In particular, two different packages are provided. The first is BioFt, enabling the extraction of recurrent patterns based on k-mers frequency and the computation of other metrics from information retrieval, here specialized for biological sequences. The second package BioSet2Vec, instead, extends the functionality of BioFt by allowing the creation of dictionaries according to different criteria. Conclusions: The framework has been validated on three different case studies: (1) the characterization of different chromatin states; (2) the study of association between different diseases and related genes; (3) the analysis of genomes of different organisms. All tests performed on the considered datasets have shown the potentialities of the proposed approach.

Galluzzo, Y., Giancarlo, R., Rombo, S.E., Utro, F. (2025). BioSet2Vec: extraction of k-mer dictionaries from multiple sets of biological sequences via big data technologies. BMC BIOINFORMATICS, 26(1) [10.1186/s12859-025-06261-7].

BioSet2Vec: extraction of k-mer dictionaries from multiple sets of biological sequences via big data technologies

Galluzzo, Ylenia
;
Giancarlo, Raffaele;Rombo, Simona E.;
2025-10-27

Abstract

Background: In several contexts involving large collections of sets of biological sequences, a relevant problem is that of selecting significant groups of k-mers that characterize one set with regards to the others in the same collection. Results: Here a software framework is proposed implementing a novel methodology for the extraction of k-mer dictionaries, from multiple sets of biological sequences. It has been implemented according to the most recent technologies for Big Data analytics, with the perspective of allowing its usage with a variety of input datasets of any size. In particular, two different packages are provided. The first is BioFt, enabling the extraction of recurrent patterns based on k-mers frequency and the computation of other metrics from information retrieval, here specialized for biological sequences. The second package BioSet2Vec, instead, extends the functionality of BioFt by allowing the creation of dictionaries according to different criteria. Conclusions: The framework has been validated on three different case studies: (1) the characterization of different chromatin states; (2) the study of association between different diseases and related genes; (3) the analysis of genomes of different organisms. All tests performed on the considered datasets have shown the potentialities of the proposed approach.
27-ott-2025
Settore INFO-01/A - Informatica
Galluzzo, Y., Giancarlo, R., Rombo, S.E., Utro, F. (2025). BioSet2Vec: extraction of k-mer dictionaries from multiple sets of biological sequences via big data technologies. BMC BIOINFORMATICS, 26(1) [10.1186/s12859-025-06261-7].
File in questo prodotto:
File Dimensione Formato  
s12859-025-06261-7.pdf

accesso aperto

Tipologia: Versione Editoriale
Dimensione 5.99 MB
Formato Adobe PDF
5.99 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/692745
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? 0
social impact