Data Mining and Knowledge Discovery

, Volume 28, Issue 4, pp 918–960 | Cite as

Feature selection for k-means clustering stability: theoretical analysis and an algorithm



Stability of a learning algorithm with respect to small input perturbations is an important property, as it implies that the derived models are robust with respect to the presence of noisy features and/or data sample fluctuations. The qualitative nature of the stability property enhardens the development of practical, stability optimizing, data mining algorithms as several issues naturally arise, such as: how “much” stability is enough, or how can stability be effectively associated with intrinsic data properties. In the context of this work we take into account these issues and explore the effect of stability maximization in the continuous (PCA-based) k-means clustering problem. Our analysis is based on both mathematical optimization and statistical arguments that complement each other and allow for the solid interpretation of the algorithm’s stability properties. Interestingly, we derive that stability maximization naturally introduces a tradeoff between cluster separation and variance, leading to the selection of features that have a high cluster separation index that is not artificially inflated by the features variance. The proposed algorithmic setup is based on a Sparse PCA approach, that selects the features that maximize stability in a greedy fashion. In our study, we also analyze several properties of Sparse PCA relevant to stability that promote Sparse PCA as a viable feature selection mechanism for clustering. The practical relevance of the proposed method is demonstrated in the context of cancer research, where we consider the problem of detecting potential tumor biomarkers using microarray gene expression data. The application of our method to a leukemia dataset shows that the tradeoff between cluster separation and variance leads to the selection of features corresponding to important biomarker genes. Some of them have relative low variance and are not detected without the direct optimization of stability in Sparse PCA based k-means. Apart from the qualitative evaluation, we have also verified our approach as a feature selection method for \(k\)-means clustering using four cancer research datasets. The quantitative empirical results illustrate the practical utility of our framework as a feature selection mechanism for clustering.


Sparse PCA Stability Feature selection Clustering 


  1. Breiman L (1996) Bagging predictors. Mach Learn 24:123–140MATHMathSciNetGoogle Scholar
  2. Cai D, Zhang C, He X (2010) Unsupervised feature selection for multi-cluster data. In: ACM SIGKDDGoogle Scholar
  3. Cho H (2010) Data transformation for sum squared residue. In: Zaki MJ, Yu JX, Ravindran B, Pudi V (eds) PAKDD (1). Lecture notes in computer science, vol 6118. Springer, Berlin, pp 48–55Google Scholar
  4. Chomez P, De Backer O, Bertrand M, De Plaen E, Boon T, Lucas S (2001) An overview of the MAGE gene family with the identification of all human members of the family. Cancer Res 61(14):5544–5551Google Scholar
  5. d’Aspremont A, Bach FR, Ghaoui LE (2007) Full regularization path for sparse principal component analysis. In: ICMLGoogle Scholar
  6. d’Aspremont A, Bach F, Ghaoui LE (2008) Optimal solutions for sparse principal component analysis. J Mach Learn Res 9:1269–1294MATHMathSciNetGoogle Scholar
  7. Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: ACM SIGKDDGoogle Scholar
  8. Dhillon IS, Guan Y, Kulis B (2004) Kernel k-means: spectral clustering and normalized cuts. In: ACM SIGKDDGoogle Scholar
  9. Ding CHQ, He X (2004) K-means clustering via principal component analysis. In: ICMLGoogle Scholar
  10. Dy JG, Brodley CE (2004) Feature selection for unsupervised learning. J Mach Learn Res 5:845–889MATHMathSciNetGoogle Scholar
  11. Efron B, Tibshirani R (1993) An introduction to the bootstrap. Chapman Hall, New YorkCrossRefMATHGoogle Scholar
  12. Golub GH, Loan CFV (1996) Matrix computations. The Johns Hopkins University Press, BaltimoreMATHGoogle Scholar
  13. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537CrossRefGoogle Scholar
  14. Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inf Syst 17(2–3):107–145CrossRefMATHGoogle Scholar
  15. Han Y, Yu L (2010) A variance reduction framework for stable feature selection. In: IEEE ICDMGoogle Scholar
  16. He X, Cai D, Niyogi P (2005) Laplacian score for feature selection. In: NIPSGoogle Scholar
  17. Huang L, Yan D, Jordan MI, Taft N (2008) Spectral clustering with perturbed data. In: D. Koller, D. Schuurmans, Y. Bengio, L. Bottou (eds) Advances in neural information processing systems 21, Proceedings of the twenty-second annual conference on neural information processing systems, Vancouver, BC, Canada, December 8–11, 2008. MIT Press, Cambridge, pp 705–712Google Scholar
  18. Kalousis A, Prados J, Hilario M (2007) Stability of feature selection algorithms: a study on high-dimensional spaces. Knowl Inf Syst 12(1):95–116CrossRefGoogle Scholar
  19. Loscalzo S, Yu L, Ding CHQ (2009) Consensus group stable feature selection. In: ACM SIGKDDGoogle Scholar
  20. Mackey L (2008) Deflation methods for sparse PCA. In: NIPSGoogle Scholar
  21. Manning CD, Raghavan P, Schtze H (2008) Introduction to information retrieval. Cambridge University Press, CambridgeCrossRefMATHGoogle Scholar
  22. Mavroeidis D, Bingham E (2008) Enhancing the stability of spectral ordering with sparsification and partial supervision: application to paleontological data. In: Proceedings of the 2008 eighth IEEE international conference on data mining. IEEE Computer Society, Washington, pp 462–471. doi:10.1109/ICDM.2008.120
  23. Mavroeidis D, Bingham E (2010) Enhancing the stability and efficiency of spectral ordering with partial supervision and feature selection. Knowl Inf Syst 23:243–265CrossRefGoogle Scholar
  24. Mavroeidis D, Magdalinos P (2012) A sequential sampling framework for spectral k-means based on efficient bootstrap accuracy estimations: application to distributed clustering. ACM Trans Knowl Discov Data 7(2)Google Scholar
  25. Mavroeidis D, Marchiori E (2011) A novel stability based feature selection framework for k-means clustering. In: Proceedings of the 2011 European conference on machine learning and knowledge discovery in databases—part II, ECML PKDD’11. Springer, Berlin, pp 421–436Google Scholar
  26. Mavroeidis D, Vazirgiannis M (2007) Stability based sparse lSI/PCA: incorporating feature selection in lSI and PCA. In: Proceedings of the 18th European conference on machine learning, ECML ’07. Springer, Berlin, pp 226–237Google Scholar
  27. Munson MA, Caruana R (2009) On feature selection, bias-variance, and bagging. In: ECML/PKDDGoogle Scholar
  28. Nicolas E, Ramus C, Berthier S, Arlotto M, Bouamrani A, Lefebvre C, Morel F, Garin J, Ifrah N, Berger F, Cahn JY, Mossuz P (2011) Expression of S100A8 in leukemic cells predicts poor survival in de novo AML patients. Leukemia 25:57–65CrossRefGoogle Scholar
  29. Saeys Y, Abeel T, de Peer YV (2008) Robust feature selection using ensemble feature selection techniques. In: ECML/PKDDGoogle Scholar
  30. Sandrine D, Jane F (2003) Bagging to improve the accuracy of a clustering procedure. Bioinformatics 19:1090–1099CrossRefGoogle Scholar
  31. Scupoli M, Donadelli M, Cioffi F, Rossi M, Perbellini O, Malpeli G, Corbioli S, Vinante F, Krampera M, Palmieri M, Scarpa A, Ariola C, Foa R, Pizzolo G (2008) Bone marrow stromal cells and the upregulation of interleukin-8 production in human T-cell acute lymphoblastic leukemia through the cxcl12/cxcr4 axis and the nf-kappab and jnk/ap-1 pathways. Haematologica 93(4):524–532CrossRefGoogle Scholar
  32. Shahzad A, Knapp M, Lang I, Kohler G (2010) Interleukin 8 (IL-8)—a universal biomarker? Int Arch Med 3(11)Google Scholar
  33. Stewart GW, Sun JG (1990) Matrix perturbation theory. Computer science and scientific computing. Academic Press, BostonGoogle Scholar
  34. Waugh D, Wilson C (2008) The interleukin8 pathway in cancer. Clin Cancer Res 14(21):6735–6741Google Scholar
  35. Wolf L, Shashua A (2005) Feature selection for unsupervised and supervised inference: the emergence of sparsity in a weight-based approach. J Mach Learn Res 6:1855–1887MATHMathSciNetGoogle Scholar
  36. Wu J, Xiong H, Chen J (2009) Adapting the right measures for k-means clustering. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’09. ACM, New York, pp 877–886Google Scholar
  37. Yu L, Ding CHQ, Loscalzo S (2008) Stable feature selection via dense feature groups. In: ACM SIGKDDGoogle Scholar
  38. Zhao Z, Liu H (2007) Spectral feature selection for supervised and unsupervised learning. In: Proceedings of the 24th international conference on machine learning, ICML ’07. ACM, New York, pp 1151–1157Google Scholar

Copyright information

© The Author(s) 2013

Authors and Affiliations

  1. 1.IBM Research–IrelandDublin 15Ireland
  2. 2.Department of Computer Science, Faculty of SciencesRadboud UniversityNijmegenThe Netherlands

Personalised recommendations