Abstract
The paper focuses on a problem of comparing clusterings with the same number of clusters obtained as a result of using different clustering algorithms. It proposes a method of the evaluation of the agreement of clusterings based on the combination of the Cohen’s kappa statistic and the normalized mutual information. The main contributions of the proposed approach are: (i) the reliable use in practice in the case of a small fixed number of clusters, (ii) the suitability to comparing clusterings with a higher number of clusters in contrast with the original statistics, (iii) the independence on size of the data set and shape of clusters. Results of the experimental validation of the proposed statistic using both simulations and real data sets as well as the comparison with the theoretical counterparts are demonstrated.
Similar content being viewed by others
References
Asioli, D., Berget, I., Næs, T.: Comparison of different clustering methods for investigating individual differences using choice experiments. Food Res. Int. 111, 371–378 (2018)
Andrzejak, R.G., Lehnertz, K., Rieke, C., Mormann, F., David, P., Elger, C.E.: Indications of nonlinear deterministic and finite dimensional structures in time series of brain electrical activity: dependence on recording region and brain state. Phys. Rev. E 64, 061907 (2001)
Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 1960, 196037–196046 (1960)
Dang, U.J., Punzo, A., McNicholas, P.D., Ingrassia, S., Browne, R.P.: Multivariate response and parsimony for Gaussian cluster-weighted models. J. Classif. 34(1), 4–34 (2017)
Dua, D., Graff, C.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CA. http://archive.ics.uci.edu/ml (2019)
Dunn, J.C.: A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. Cybern. Syst. 3, 32–57 (1973)
Fowlkes, E.B., Mallows, C.L.: A method for comparing two hierarchical clusterings. J. Am. Stat. Assoc. 78, 553–569 (1983)
Fred, A., Jain, A.: Robust data clustering. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. CVPR 3, 128–136 (2003)
Frühwirth-Schnatter, S.: Finite Mixture and Markov Switching Models, 2nd edn, p. 2006. Springer, New York (2006)
Guha, S., Rastogi, R., Shim, K.: Rock: a robust clustering algorithm for categorical attributes. Inf. Syst. 25(5), 345–366 (2000)
Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann, Burlington (2011)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics), 2nd edn. Springer, New York (2016)
Hidri, M.S., Zoghlami, M.A., Ayed, R.B.: Speeding up the large-scale consensus fuzzy clustering for handling Big Data. Fuzzy Sets Syst. 348, 50–74 (2018)
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2, 193–218 (1985)
Ingrassia, S., Punzo, A., Vittadini, G., Minotti, S.C.: The generalized linear mixed cluster-weighted model. J. Classif. 32(1), 85–113 (2015)
Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 31(8), 651–666 (2010)
Kruskal, W.H., Wallis, W.A.: Use of ranks in one-criterion variance analysis. J. Am. Stat. Assoc. 47(260), 583–621 (1952)
Larose, D.T.: Discovering Knowledge in Data. An Introduction to Data Mining. Wiley, Hoboken (2005)
Larsen, B., Aone, C.: Fast and effective text mining using linear time document clustering. In: Proceedings of the KDD, pp. 16–29 (1999)
Lee, S.-H., Jeong, Y.-S., Kim, J.-Y., Jeong, M.K.: A new clustering clusters validity index for arbitrary shape of clusters. Pattern Recognit. Lett. 112, 263–269 (2018)
Li, G., Hu, Y.: Improved sensor fault detection, diagnosis and estimation for screw chillers using density-based clustering and principal component analysis. Energy Build. 173, 502–515 (2018)
Li, T., Ogihara, M., Ma, S.: On combining multiple clusterings. Proc. ACM Conf. Inf. Knowl. Manag. 13, 294–303 (2004)
Maione, C., Nelson, D.R., Barbosa, R.M.: Research on social data by means of cluster analysis. Appl. Comput. Inform. (2018). https://doi.org/10.1016/j.aci.2018.02.003. (in press)
MacCuish, J.D., MacCuish, N.E.: Clustering in Bioinformatics and Drug Discovery. CRC Press, Boca Raton (2010)
MacCuish, J.D., MacCuish, N.E.: Chemoinformatics applications of cluster analysis. Wiley interdisciplinary reviews. Comput. Mol. Sci. 4(1), 34–48 (2014)
Marinov, I., Luxová, A., Tkácová, V., Gašová, Z., Pohlreich, D., Cetkovský, P.: Comparison of three single platform methods for CD34+ hematopoietic stem cell enumeration by flow cytometry. Clin. Lab. 57(11–12), 1031–1035 (2011)
Massey Jr., F.J.: The Kolmogorov–Smirnov test for goodness of fit. J. Am. Stat. Assoc. 46(253), 68–78 (1951)
Mazza, A., Punzo, A., Ingrassia, S.: flexCWM: a flexible framework for cluster-weighted models. J. Stat. Softw. 86(2), 1–30 (2018)
Meilă, M.: Comparing clusterings: an axiomatic view. In: ICML’05 Proceedings of the 22nd International Conference on Machine Learning, pp. 577–584 (2005)
Meilă, M.: Comparing clusterings—an information based distance. J. Multivar. Anal. 98, 873–895 (2007)
Meila, M.: Criteria for comparing clusterings. In: Hennig, C., Meila, M., Murtagh, F., Rocci, R. (eds.) Handbook of Cluster Analysis, pp. 619–637. Chapman and Hall/CRC, New York (2016)
Meilă, M., Heckerman, D.: An experimental comparison of model-based clustering methods. In: Proceedings of the Conference on Knowledge Discovery and Data Mining, pp. 16–22 (1999)
Nagy, I., Suzdaleva, E.: Algorithms and Programs of Dynamic Mixture Estimation. Unified Approach to Different Types of Components. Springer Briefs in Statistics. Springer International Publishing, Heidelberg (2017)
Ng, R., Han, J.: Efficient and effective clustering method for spatial data mining. In: Proc. 1994 Int. Conf. Very Large Data Bases (VLDB’94), Santiago, pp. 144–155 (1994)
Punzo, A., Igrassia, S.: Clustering bivariate mixed-type data via the cluster-weight model. Compu. Stat. 31(3), 989–1013 (2016)
Rand, W.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66, 846–850 (1971)
Reilly, C., Wang, Ch., Rutherford, M.: A rapid method for the comparison of cluster analyses. Stat. Sin. 15, 19–33 (2005)
Rodríguez-Ramos, A., da Silva Neto, A.J., Llanes-Santiago, O.: An approach to fault diagnosis with online detection of novel faults using fuzzy clustering tools. Expert Syst. Appl. 113, 200–212 (2018)
Saini, S., Rani, P.: A survey on STING and CLIQUE grid based clustering methods. Int. J. Adv. Res. Comput. Sci. 8(5), 1510–1512 (2017)
Shiau, W.-L., Dwivedi, Y.K., Yang, H.S.: Co-citation and cluster analyses of extant literature on social networks. Int. J. Inf. Manag. 37(5), 390–399 (2017)
Sirsikar, S., Wankhede, K.: Comparison of clustering algorithms to design new clustering approach. Proc. Comput. Sci. 49, 147–154 (2015)
Schütz, T., Schraven, M.H., Fuchs, M., Remmen, P., Müller, D.: Comparison of clustering algorithms for the selection of typical demand days for energy system synthesis. Renew. Energy 129(A), 570–582 (2018)
Steinhaus, H.: Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci. (C1.III) IV, 801–804 (1956)
Strehl, A., Ghosh, J.: Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3, 583–617 (2002)
Suzdaleva, E., Nagy, I., Pecherková, P., Likhonina, R.: Initialization of recursive mixture-based clustering with uniform components. In: Proceedings of the 14th International Conference on Informatics in Control, Automation and Robotics (ICINCO 2017), pp. 449–458 (2017)
Suzdaleva, E., Nagy, I.: An online estimation of driving style using data-dependent pointer model. Transp. Res. Part C 86C, 23–36 (2018)
Tan, P.-N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Pearson, London (2005)
Tari, F.G., Hashemi, Z.: Prioritized K-mean clustering hybrid GA for discounted fixed charge transportation problems. Comput. Ind. Eng. 126, 63–74 (2018)
Umesh, U.N., Peterson, R.A., Sauber, M.H.: Interjudge agreement and the maximum value of kappa. Educ. Psychol. Meas. 49, 835–850 (1989)
van Dongen, S.: Performance criteria for graph clustering and Markov cluster experiments. Technical Report INS-R0012, Centrum voor Wiskunde en Informatica (2000)
Vendramin, L., Campello, R.J.G.B., Hruschka, E.R.: Relative clustering validity criteria: a comparative overview. Stat. Anal. Data Min. 3(4), 209–235 (2010)
Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, 2837–2854 (2010)
Wagner, S., Wagner, D.: Comparing Clusterings—An Overview. Technical Report 2006–04 (2007)
Yin, X., Chen, S., Hu, E., Zhang, D.: Semi-supervised clustering with metric learning: an adaptive kernel method. Pattern Recognit. 43(4), 1320–1333 (2010)
Zaki, M.J., Meira Jr., W.: Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press, Cambridge (2014)
Acknowledgements
This work has been supported by the project SILENSE, Project Number ECSEL 737487 and MSMT 8A17006.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Uglickich, E., Nagy, I. & Vlčková, D. Comparing clusterings using combination of the kappa statistic and entropy-based measure. METRON 77, 253–270 (2019). https://doi.org/10.1007/s40300-019-00162-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40300-019-00162-5