Abstract
Clustering validation is one of the most important and challenging parts of clustering analysis, as there is no ground truth knowledge to compare the results with. Up till now, the evaluation methods for clustering algorithms have been used for determining the optimal number of clusters in the data, assessing the quality of clustering results through various validity criteria, comparison of results with other clustering schemes, etc. It is also often practically important to build a model on a large amount of training data and then apply the model repeatedly to smaller amounts of new data. This is similar to assigning new data points to existing clusters which are constructed on the training set. However, very little practical guidance is available to measure the prediction strength of the constructed model to predict cluster labels for new samples. In this study, we proposed an extension of the cross-validation procedure to evaluate the quality of the clustering model in predicting cluster membership for new data points. The performance score was measured in terms of the root mean squared error based on the information from multiple labels of the training and testing samples. The principal component analysis (PCA) followed by k-means clustering algorithm was used to evaluate the proposed method. The clustering model was tested using three benchmark multi-label datasets and has shown promising results with overall RMSE of less than 0.075 and MAPE of less than 12.5% in three datasets.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Ben-David S, Von Luxburg U. Relating clustering stability to properties of cluster boundaries. In: 21st Annual Conference on Learning Theory, COLT 2008. 2008.
Bengio Y, et al. Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell. 2013;35(8):1798–828. https://doi.org/10.1109/TPAMI.2013.50.
Caliñski T, Harabasz J. A Dendrite method foe cluster analysis. Commun Stat. 1974. https://doi.org/10.1080/03610927408827101.
Chakraborty S et al. Entropy regularized power k-means clustering. 23rd International Conference on Artificial Intelligence and Statistics (AISTATS 2020), Palermo, Italy; 2020. http://arxiv.org/abs/2001.03452.
Chakraborty S, Das S. K-Means clustering with a new divergence-based distance metric: convergence and performance analysis. Pattern Recogn Lett. 2017. https://doi.org/10.1016/j.patrec.2017.09.025.
Cordeiro De Amorim R, Mirkin B. Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering. Pattern Recogn. 2012;45:1061. https://doi.org/10.1016/j.patcog.2011.08.012.
Davies DL, Bouldin DW. A cluster separation measure. IEEE Trans Pattern Anal Mach Intell. 1979. https://doi.org/10.1109/TPAMI.1979.4766909.
Do JH, Choi DK. Normalization of microarray data: single-labeled and dual-labeled arrays. Mole Cells. 2006;22(3):254–61.
Dokmanic I, et al. Euclidean distance matrices: essential theory, algorithms, and applications. IEEE Signal Process Mag. 2015. https://doi.org/10.1109/MSP.2015.2398954.
Elisseeff A, Weston J. A kernel method for multi-labelled classification. In: Advances in neural information processing systems. Cambridge: The MIT Press; 2002. https://doi.org/10.7551/mitpress/1120.003.0092.
Estivill-Castro V. Why so many clustering algorithms. ACM SIGKDD Explor Newsl. 2002. https://doi.org/10.1145/568574.568575.
Goran Petrović ŽĆ. Comparison of clustering methods for failure data analysis: a real life application. In: Proceedings of the XV international scientific conference on industrial systems (IS’11). pp. 297–300; 2011.
Hassani M, Seidl T. Using internal evaluation measures to validate the quality of diverse stream clustering algorithms. Vietnam J Comput Sci. 2017. https://doi.org/10.1007/s40595-016-0086-9.
Hennig C, et al. Handbook of cluster analysis. 2015. https://doi.org/10.1201/b19706.
Jain AK. Data clustering: 50 years beyond K-means. Pattern Recogn Lett. 2010;31(8):651–66. https://doi.org/10.1016/j.patrec.2009.09.011.
Jin J, Wang W. Influential features PCA for high dimensional clustering. Ann Stat. 2016. https://doi.org/10.1214/15-AOS1423.
Kleinberg J. An impossibility theorem for clustering. In: Advances in neural information processing systems (NIPS).pp. 446–453. MIT Press, Cambridge;2002.
Lewis CD. Industrial and business forecasting methods: a practical guide to exponential smoothing and curve fitting. Oxford: Butterworth Scientific; 1982. https://doi.org/10.1002/for.3980010202.
Li W, et al. Application of t-SNE to human genetic data. J Bioinf Comput Biol. 2017;15(04):1750017. https://doi.org/10.1142/S0219720017500172.
Lv Y, et al. An efficient and scalable density-based clustering algorithm for datasets with complex structures. Neurocomputing. 2016. https://doi.org/10.1016/j.neucom.2015.05.109.
Miljkovic D. Brief review of self-organizing maps. In: 2017 40th International convention on information and communication technology, electronics and microelectronics, MIPRO 2017—Proceedings; 2017. https://doi.org/10.23919/MIPRO.2017.7973581.
Moulavi D et al. Density-based clustering validation. In: Proceedings of the 2014 SIAM international conference on data mining. pp. 839–847 Society for Industrial and Applied Mathematics, Philadelphia, PA; 2014. https://doi.org/10.1137/1.9781611973440.96.
Napoleon D, Pavalakodi S. A new method for dimensionality reduction using K means clustering algorithm for high dimensional data set. Int J Comput Appl. 2011;13(7):41–6. https://doi.org/10.5120/1789-2471.
Olukanmi P, et al. Rethinking k-means clustering in the age of massive datasets: a constant-time approach. Neural Comput Appl. 2019. https://doi.org/10.1007/s00521-019-04673-0.
Rakhlin A, Caponnetto A. Stability of K-means clustering. In: Advances in neural information processing systems; 2007. https://doi.org/10.1007/978-3-540-72927-3_4.
Rand WM. Objective criteria for the evaluation of clustering methods. J Am Stat Assoc. 1971. https://doi.org/10.1080/01621459.1971.10482356.
Rendón E, et al. Internal versus external cluster validation indexes. Int J Comput Commun. 2011;5(1):27–34.
Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987;20:53–655. https://doi.org/10.1016/0377-0427(87)90125-7.
Sahu L, Mohan BR. An improved K-means algorithm using modified cosine distance measure for document clustering using Mahout with Hadoop. In: 9th International conference on industrial and information systems, ICIIS 2014; 2015. https://doi.org/10.1109/ICIINFS.2014.7036661.
Sidhu RS, et al. A subtractive clustering based approach for early prediction of fault proneness in software modules. World Acad Sci. Eng Technol. 2010;. https://doi.org/10.5281/zenodo.1331265.
Silverman BW. Density estimation: for statistics and data analysis. 2018. https://doi.org/10.1201/9781315140919.
Syms C. Principal components analysis. In: Encyclopedia of ecology. Amsterdam: Elsevier; 2018. https://doi.org/10.1016/B978-0-12-409548-9.11152-2.
Tan P-N et al. Chap 8: Cluster analysis: basic concepts and algorithms. Introduction to data mining. 2005. https://doi.org/10.1016/0022-4405(81)90007-8.
Tarekegn A, et al. Predictive Modeling for Frailty Conditions in Elderly People: Machine Learning Approaches. JMIR medical informatics. 2020;8:e16678. http://www.ncbi.nlm.nih.gov/pubmed/32442149.
Tarekegn A et al. Detection of frailty using genetic programming. Presented at the (2020). https://doi.org/10.1007/978-3-030-44094-7_15.
Tibshirani R, Walther G. Cluster validation by prediction strength. J Comput Graph Stat. 2005. https://doi.org/10.1198/106186005X59243.
Trohidis K et al. Multi-label classification of music into emotions. In: ISMIR 2008—9th international conference on music information retrieval. 2008.
Vinh NX et al. Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res. 2010;11(95):2837−2854.
Wang J. Consistent selection of the number of clusters via crossvalidation. Biometrika. 2010. https://doi.org/10.1093/biomet/asq061.
Wilks DS. Cluster analysis. Int Geophys. 2011;100:603–616. https://doi.org/10.1016/B978-0-12-385022-5.00015-4.
Witten DM, Tibshirani R. A framework for feature selection in clustering. J Am Stat Assoc. 2010. https://doi.org/10.1198/jasa.2010.tm09415.
Xu R, WunschII D. Survey of clustering algorithms. IEEE Trans Neural Netw. 2005;16(3):645–78. https://doi.org/10.1109/TNN.2005.845141.
Zhang X, et al. A novel deep neural network model for multi-label chronic disease prediction. Front Genet. 2019. https://doi.org/10.3389/fgene.2019.00351.
Acknowledgements
The author would like to thank the reviewers of this paper for their supportive comments.
Funding
No funding was received for this study.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The author declares no competing interests.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Tarekegn, A.N., Michalak, K. & Giacobini, M. Cross-Validation Approach to Evaluate Clustering Algorithms: An Experimental Study Using Multi-Label Datasets. SN COMPUT. SCI. 1, 263 (2020). https://doi.org/10.1007/s42979-020-00283-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-020-00283-z