Abstract
Clustering has been widely applied in interpreting the underlying patterns in microarray gene expression profiles, and many clustering algorithms have been devised for the same. K-means is one of the popular algorithms for gene data clustering due to its simplicity and computational efficiency. But, K-means algorithm is highly sensitive to the choice of initial cluster centers. Thus, the algorithm easily gets trapped with local optimum if the initial centers are chosen randomly. This paper proposes a deterministic initialization algorithm for K-means (DK-means) by exploring a set of probable centers through a constrained bi-partitioning approach. The proposed algorithm is compared with classical K-means with random initialization and improved K-means variants such as K-means++ and MinMax algorithms. It is also compared with three deterministic initialization methods. Experimental analysis on gene expression datasets demonstrates that DK-means achieves improved results in terms of faster and stable convergence, and better cluster quality as compared to other algorithms.
Similar content being viewed by others
References
Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X et al (2000) Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature 403(6769):503–511
Alrabea A, Senthilkumar A, Al-Shalabi H, Bader A (2013) Enhancing k-means algorithm with initial cluster centers derived from data partitioning along the data axis with PCA. J Adv Comput Netw 1(2):137–142
Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms, society for industrial and applied mathematics, pp 1027–1035
Bianchi FM, Livi L, Rizzi A (2016) Two density-based k-means initialization algorithms for non-metric data clustering. Pattern Anal Appl 19(3):745–763
Bradley PS, Fayyad UM (1998) Refining initial points for k-means clustering. In: Proceedings of 15th international conference on machine learning (ICML), vol 98. pp 91–99
Broad Institute Cancer Program Datasets (2016) http://broadinstitute.org/cgi-bin/cancer/
Celebi ME, Kingravi HA (2012) Deterministic initialization of the k-means algorithm using hierarchical clustering. Int J Pattern Recognit Artif Intell 26(07):1250,018–1–1250,018–25
Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40(1):200–210
Chavent M, Lechevallier Y, Briant O (2007) Divclus-t: a monothetic divisive hierarchical clustering method. Comput Stat Data Anal 52(2):687–701
Ding C, He X (2004) K-means clustering via principal component analysis. In: International conference on machine learning (ICML), ACM, pp 29–36
Du Z, Wang Y, Ji Z (2008) PK-means: a new algorithm for gene clustering. Comput Biol Chem 32(4):243–247
Duwairi R, Abu-Rahmeh M (2015) A novel approach for initializing the spherical k-means clustering algorithm. Simul Modell Pract Theory 54:49–63
Erisoglu M, Calis N, Sakallioglu S (2011) A new algorithm for initial cluster centers in k-means algorithm. Pattern Recognit Lett 32(14):1701–1705
Giancarlo R, Utro F (2011) Speeding up the consensus clustering methodology for microarray data analysis. Algorithms Mol Biol 6(1):1–13
Gionis A, Mannila H, Tsaparas P (2007) Clustering aggregation. ACM Trans Knowl Discov Data (TKDD) 1(1):1–30
Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inf Syst 17(2):107–145
Hoshida Y, Brunet JP, Tamayo P, Golub TR, Mesirov JP (2007) Subclass mapping: identifying common subtypes in independent disease data sets. PloS ONE 2(11):e1195
Jain AK, Law MH (2005) Data clustering: a user’s dilemma. Pattern Recognit Mach Intell 3776:1–10
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv (CSUR) 31(3):264–323
Jiang D, Tang C, Zhang A (2004) Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng 16(11):1370–1386
Jothi R, Mohanty SK, Ojha A (2016a) Functional grouping of similar genes using eigenanalysis on minimum spanning tree based neighborhood graph. Comput Biol Med 71:135–148
Jothi R, Mohanty SK, Ojha A (2016b) On careful selection of initial centers for k-means algorithm. In: Proceedings of 3rd international conference on advanced computing, networking and informatics: ICACNI 2015, Vol 1, Springer India, New Delhi, pp 435–445
Kerr G, Ruskin HJ, Crane M, Doolan P (2008) Techniques for clustering gene expression data. Comput Biol Med 38(3):283–293
Khan SS, Ahmad A (2004) Cluster center initialization algorithm for k-means clustering. Pattern Recognit Lett 25(11):1293–1302
Krishna K, Murty MN (1999) Genetic k-means algorithm. IEEE Trans Syst Man Cybern Part B: Cybern 29(3):433–439
Lam YK, Tsang PW (2012) eXploratory k-means: a new simple and efficient algorithm for gene clustering. Appl Soft Comput 12(3):1149–1157
Lam YK, Tsang PWM, Leung CS (2013) Pso-based k-means clustering with enhanced cluster matching for gene expression data. Neural Comput Appl 22(7–8):1349–1355
Likas A, Vlassis N, Verbeek JJ (2003) The global k-means clustering algorithm. Pattern Recognit 36(2):451–461
Liu M, Jiang X, Kot AC (2009) A multi-prototype clustering algorithm. Pattern Recognit 42(5):689–698
Lu Y, Lu S, Fotouhi F, Deng Y, Brown SJ (2004a) FGKA: A fast genetic k-means clustering algorithm. In: Proceedings of the 2004 ACM symposium on Applied computing, ACM, pp 622–623
Lu Y, Lu S, Fotouhi F, Deng Y, Brown SJ (2004b) Incremental genetic k-means algorithm and its application in gene expression data analysis. BMC Bioinform 5(1):172–181
Martella F, Vichi M (2012) Clustering microarray data using model-based double k-means. J Appl Stat 39(9):1853–1869
Maulik U, Mukhopadhyay A, Bandyopadhyay S (2009) Combining pareto-optimal clusters using supervised learning for identifying co-expressed genes. BMC Bioinform 10(1):27–42
Monti S, Tamayo P, Mesirov J, Golub T (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52(1):91–118
Nazeer K, Sebastian M, Kumar S (2013) A novel harmony search-k means hybrid algorithm for clustering gene expression data. Bioinformation 9(2):84–88
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Sun J, Chen W, Fang W, Wun X, Xu W (2012) Gene expression data analysis with the clustering method based on an improved quantum-behaved particle swarm optimization. Eng Appl Artif Intell 25(2):376–391
Thalamuthu A, Mukhopadhyay I, Zheng X, Tseng GC (2006) Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics 22(19):2405–2412
Ting S, Jennifer GD (2007) In search of deterministic methods for initializing k-means and gaussian mixture clustering. Intell Data Anal 11(4):319–338
Tzortzis G, Likas A (2014) The minmax k-means clustering algorithm. Pattern Recognit 47(7):2505–2516
Validating Clustering for Gene Expression Data (2012) http://faculty.washington.edu/kayee/cluster/
Xu R, Wunsch DC (2010) Clustering algorithms in biomedical research: a review. IEEE Rev Biomed Eng 3:120–154
Zahn CT (1971) Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Trans Comput 100(1):68–86
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Jothi, R., Mohanty, S.K. & Ojha, A. DK-means: a deterministic K-means clustering algorithm for gene expression analysis. Pattern Anal Applic 22, 649–667 (2019). https://doi.org/10.1007/s10044-017-0673-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-017-0673-0