Archivio istituzionale della ricerca dell'Università degli Studi di Palermo

DNA sequence decomposition into k-mers (substrings of length k) and their frequency counting, defines a mapping of a sequence into a numerical space by a numerical feature vector of fixed length. This simple process allows to compute sequence comparison in an alignment free way, using common similarities and distance functions on the numerical codomain of the mapping. The most common used decomposition uses all the substrings of length k making the codomain of exponential dimension. This obviously can affect the time complexity of the similarity computation, and in general of the machine learning algorithm used for the purpose of sequence classification. Moreover, the presence of possible noisy features can also affect seriously the classification accuracy. In this paper we propose a feature selection method able to select the most informative k-mers associated to a set of DNA sequences. Such selection is based on the Motif Independent Measure (MIM), an unbiased quantitative measure for DNA sequence specificity that we have recently introduced in the literature. Results computed on three public datasets using the Support vector machine classifier, show the effectiveness of the proposed feature selection method

Lo Bosco, G., Pinello, L. (2014). A new feature selection strategy for K-mers sequence representation. In C. Di Serio, P. Liò, S. Richardson, R. Tagliaferri (a cura di), Computational Intelligence Methods for Bioinformatics and Biostatistics, CIBB 2014 (pp. 1-6).

A new feature selection strategy for K-mers sequence representation

LO BOSCO, Giosue';Pinello, L.

2014-01-01

Abstract

DNA sequence decomposition into k-mers (substrings of length k) and their frequency counting, defines a mapping of a sequence into a numerical space by a numerical feature vector of fixed length. This simple process allows to compute sequence comparison in an alignment free way, using common similarities and distance functions on the numerical codomain of the mapping. The most common used decomposition uses all the substrings of length k making the codomain of exponential dimension. This obviously can affect the time complexity of the similarity computation, and in general of the machine learning algorithm used for the purpose of sequence classification. Moreover, the presence of possible noisy features can also affect seriously the classification accuracy. In this paper we propose a feature selection method able to select the most informative k-mers associated to a set of DNA sequences. Such selection is based on the Motif Independent Measure (MIM), an unbiased quantitative measure for DNA sequence specificity that we have recently introduced in the literature. Results computed on three public datasets using the Support vector machine classifier, show the effectiveness of the proposed feature selection method

Scheda breve

Scheda completa

Scheda completa (DC)

	Data
	
				2014
			
	ISBN della monografia 
DATO PREVISTO SU LOGINMIUR
	
				9788890643743
			
	Citazione
	
				Lo Bosco, G., Pinello, L. (2014). A new feature selection strategy for K-mers sequence representation. In C. Di Serio, P. Liò, S. Richardson, R. Tagliaferri (a cura di), Computational Intelligence Methods for Bioinformatics and Biostatistics, CIBB 2014 (pp. 1-6).
			
	Appare nelle tipologie:
	
				2.07 Contributo in atti di convegno pubblicato in volume

File in questo prodotto:

File	Dimensione	Formato
Lo Bosco, Pinello - 2014 - A new feature selection strategy for k-mers sequence representation .pdf Solo gestori archvio Tipologia: Versione Editoriale Dimensione 76.35 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	76.35 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/96647

Citazioni

ND

ND

ND

social impact