Computational cluster validation in post-genomic data analysis

Julia Handl, Joshua Knowles and Douglas Kell

Keywords: cluster validation, clustering, evaluation, validation measure, validity index, internal, external, rand index, f-measure
Abstract

The discovery of novel biological knowledge from the ab initio analysis of post-genomic data relies upon the use of unsupervised processing methods, in particular clustering techniques. Much recent research in bioinformatics has therefore focused on the transfer of clustering methods introduced in other scientific fields, and on the development of novel algorithms specifically designed to tackle the challenges posed by post-genomic data. The partitions returned by a clustering algorithm are commonly validated using visual inspection and concordance with prior biological knowledge - whether the clusters actually correspond to real structure in the data is somewhat less frequently considered. Suitable computational cluster validation techniques are available in the general data-mining literature, but have been given only a fraction of the same attention in bioinformatics. This review paper aims to familiarize the reader with the battery of techniques available for the validation of clustering results, with a particular focus on their application for post-genomics. Synthetic and real biological data sets are used to demonstrate the benefits, and also some of the perils, of analytical cluster validation.

Downloads:
Paper
Supplementary material
C++ Code and Data sets used in the paper

Solution visualization in two-objective space. Shown are the solutions (averages over 21 runs) for k-means, SOM, average link and single link on the Leukemia data set in a plot of Connectivity versus Variance. The knee corresponding to the three-cluster solution is clearly pronounced. The visualization also shows the consistency between the k-means, SOM and average link solutions for k=2 and k=3, which further increases the confidence in the correctness of these partitionings.