Algorithmic paradigms for stability-based cluster validity and model selection statistical methods, with applications to microarray data analysis.

Giancarlo, R.; Utro, F.

doi:10.1016/j.tcs.2012.01.024

The advent of high throughput technologies, in particular microarrays, for biological research has revived interest in clustering, resulting in a plethora of new clustering algorithms. However, model selection, i.e., the identification of the correct number of clusters in a dataset, has received relatively little attention. Indeed, although central for statistics, its difficulty is also well known. Fortunately, a few novel techniques for model selection, representing a sharp departure from previous ones in statistics, have been proposed and gained prominence for microarray data analysis. Among those, the stability-based methods are the most robust and best performing in terms of prediction, but the slowest in terms of time. It is very unfortunate that as fascinating and classic an area of statistics as model selection, with important practical applications, has received very little attention in terms of algorithmic design and engineering. In this paper, in order to partially fill this gap, we make the following contributions: (A) the first general algorithmic paradigm for stability-based methods for model selection; (B) reductions showing that all of the known methods in this class are an instance of the proposed paradigm; (C) a novel algorithmic paradigm for the class of stability-based methods for cluster validity, i.e., methods assessing how statistically significant is a given clustering solution; (D) a general algorithmic paradigm that describes heuristic and very effective speed-ups known in the literature for stability-based model selection methods. Since the performance evaluation of model selection algorithms is mainly experimen- tal, we offer, for completeness and without even attempting to be exhaustive, a represen- tative synopsis of known experimental benchmarking results that highlight the ability of stability-based methods for model selection and the computational resources that they re- quire for the task. As a whole, the contributions of this paper generalize in several respects reference methodologies in statistics and show that algorithmic approaches can yield deep methodological insights into this area, in addition to practical computational procedures.

Giancarlo, R., Utro, F. (2012). Algorithmic paradigms for stability-based cluster validity and model selection statistical methods, with applications to microarray data analysis. THEORETICAL COMPUTER SCIENCE, 428(428), 58-79 [10.1016/j.tcs.2012.01.024].