DNAsequence decomposition into k-mers and their frequency counting, defines a mapping of a sequence into a numerical space by a numerical feature vector of fixed length. This simple process allows to compare sequences in an alignment free way, using common similarities and distance functions on the numerical codomain of the mapping. The most common used decomposition uses all the substrings of a fixed length k making the codomain of exponential dimension. This obviously can affect the time complexity of the similarity computation, and in general of the machine learning algorithm used for the purpose of sequence analysis.Moreover, the presence of possible noisy features can also affect the classification accuracy. In this paper we propose a feature selection method able to select the most informative k-mers associated to a set of DNA sequences. Such selection is based on the Motif Independent Measure (MIM), an unbiased quantitative measure for DNA sequence specificity that we have recently introduced in the literature. Results computed on public datasets show the effectiveness of the proposed feature selection method

Lo Bosco, G., Pinello, L. (2015). A New Feature Selection Methodology for K-mers Representation of DNA Sequences. In C. DI Serio (a cura di), Computational Intelligence Methods for Bioinformatics and Biostatistics (pp. 99-108). Springer Verlag [10.1007/978-3-319-24462-4_9].

A New Feature Selection Methodology for K-mers Representation of DNA Sequences

LO BOSCO, Giosue';
2015-01-01

Abstract

DNAsequence decomposition into k-mers and their frequency counting, defines a mapping of a sequence into a numerical space by a numerical feature vector of fixed length. This simple process allows to compare sequences in an alignment free way, using common similarities and distance functions on the numerical codomain of the mapping. The most common used decomposition uses all the substrings of a fixed length k making the codomain of exponential dimension. This obviously can affect the time complexity of the similarity computation, and in general of the machine learning algorithm used for the purpose of sequence analysis.Moreover, the presence of possible noisy features can also affect the classification accuracy. In this paper we propose a feature selection method able to select the most informative k-mers associated to a set of DNA sequences. Such selection is based on the Motif Independent Measure (MIM), an unbiased quantitative measure for DNA sequence specificity that we have recently introduced in the literature. Results computed on public datasets show the effectiveness of the proposed feature selection method
2015
Settore INF/01 - Informatica
978-3-319-24461-7
Lo Bosco, G., Pinello, L. (2015). A New Feature Selection Methodology for K-mers Representation of DNA Sequences. In C. DI Serio (a cura di), Computational Intelligence Methods for Bioinformatics and Biostatistics (pp. 99-108). Springer Verlag [10.1007/978-3-319-24462-4_9].
File in questo prodotto:
File Dimensione Formato  
86230099.pdf

Solo gestori archvio

Tipologia: Versione Editoriale
Dimensione 206.75 kB
Formato Adobe PDF
206.75 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/145234
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 12
  • ???jsp.display-item.citation.isi??? 9
social impact