Comparing clusterings using combination of the kappa statistic and entropy-based measure

Uglickich, Evženie; Nagy, Ivan; Vlčková, Dominika

doi:10.1007/s40300-019-00162-5

Comparing clusterings using combination of the kappa statistic and entropy-based measure

Published: 16 November 2019

Volume 77, pages 253–270, (2019)
Cite this article

METRON Aims and scope Submit manuscript

203 Accesses
1 Citation
Explore all metrics

Abstract

The paper focuses on a problem of comparing clusterings with the same number of clusters obtained as a result of using different clustering algorithms. It proposes a method of the evaluation of the agreement of clusterings based on the combination of the Cohen’s kappa statistic and the normalized mutual information. The main contributions of the proposed approach are: (i) the reliable use in practice in the case of a small fixed number of clusters, (ii) the suitability to comparing clusterings with a higher number of clusters in contrast with the original statistics, (iii) the independence on size of the data set and shape of clusters. Results of the experimental validation of the proposed statistic using both simulations and real data sets as well as the comparison with the theoretical counterparts are demonstrated.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Article 27 November 2022

References

Asioli, D., Berget, I., Næs, T.: Comparison of different clustering methods for investigating individual differences using choice experiments. Food Res. Int. 111, 371–378 (2018)
Google Scholar
Andrzejak, R.G., Lehnertz, K., Rieke, C., Mormann, F., David, P., Elger, C.E.: Indications of nonlinear deterministic and finite dimensional structures in time series of brain electrical activity: dependence on recording region and brain state. Phys. Rev. E 64, 061907 (2001)
Google Scholar
Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 1960, 196037–196046 (1960)
Google Scholar
Dang, U.J., Punzo, A., McNicholas, P.D., Ingrassia, S., Browne, R.P.: Multivariate response and parsimony for Gaussian cluster-weighted models. J. Classif. 34(1), 4–34 (2017)
MathSciNet MATH Google Scholar
Dua, D., Graff, C.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CA. http://archive.ics.uci.edu/ml (2019)
Dunn, J.C.: A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. Cybern. Syst. 3, 32–57 (1973)
MathSciNet MATH Google Scholar
Fowlkes, E.B., Mallows, C.L.: A method for comparing two hierarchical clusterings. J. Am. Stat. Assoc. 78, 553–569 (1983)
MATH Google Scholar
Fred, A., Jain, A.: Robust data clustering. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. CVPR 3, 128–136 (2003)
Google Scholar
Frühwirth-Schnatter, S.: Finite Mixture and Markov Switching Models, 2nd edn, p. 2006. Springer, New York (2006)
MATH Google Scholar
Guha, S., Rastogi, R., Shim, K.: Rock: a robust clustering algorithm for categorical attributes. Inf. Syst. 25(5), 345–366 (2000)
Google Scholar
Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann, Burlington (2011)
MATH Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics), 2nd edn. Springer, New York (2016)
Google Scholar
Hidri, M.S., Zoghlami, M.A., Ayed, R.B.: Speeding up the large-scale consensus fuzzy clustering for handling Big Data. Fuzzy Sets Syst. 348, 50–74 (2018)
MathSciNet MATH Google Scholar
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2, 193–218 (1985)
MATH Google Scholar
Ingrassia, S., Punzo, A., Vittadini, G., Minotti, S.C.: The generalized linear mixed cluster-weighted model. J. Classif. 32(1), 85–113 (2015)
MathSciNet MATH Google Scholar
Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 31(8), 651–666 (2010)
Google Scholar
Kruskal, W.H., Wallis, W.A.: Use of ranks in one-criterion variance analysis. J. Am. Stat. Assoc. 47(260), 583–621 (1952)
MATH Google Scholar
Larose, D.T.: Discovering Knowledge in Data. An Introduction to Data Mining. Wiley, Hoboken (2005)
MATH Google Scholar
Larsen, B., Aone, C.: Fast and effective text mining using linear time document clustering. In: Proceedings of the KDD, pp. 16–29 (1999)
Lee, S.-H., Jeong, Y.-S., Kim, J.-Y., Jeong, M.K.: A new clustering clusters validity index for arbitrary shape of clusters. Pattern Recognit. Lett. 112, 263–269 (2018)
Google Scholar
Li, G., Hu, Y.: Improved sensor fault detection, diagnosis and estimation for screw chillers using density-based clustering and principal component analysis. Energy Build. 173, 502–515 (2018)
Google Scholar
Li, T., Ogihara, M., Ma, S.: On combining multiple clusterings. Proc. ACM Conf. Inf. Knowl. Manag. 13, 294–303 (2004)
Google Scholar
Maione, C., Nelson, D.R., Barbosa, R.M.: Research on social data by means of cluster analysis. Appl. Comput. Inform. (2018). https://doi.org/10.1016/j.aci.2018.02.003. (in press)
Article Google Scholar
MacCuish, J.D., MacCuish, N.E.: Clustering in Bioinformatics and Drug Discovery. CRC Press, Boca Raton (2010)
MATH Google Scholar
MacCuish, J.D., MacCuish, N.E.: Chemoinformatics applications of cluster analysis. Wiley interdisciplinary reviews. Comput. Mol. Sci. 4(1), 34–48 (2014)
Google Scholar
Marinov, I., Luxová, A., Tkácová, V., Gašová, Z., Pohlreich, D., Cetkovský, P.: Comparison of three single platform methods for CD34+ hematopoietic stem cell enumeration by flow cytometry. Clin. Lab. 57(11–12), 1031–1035 (2011)
Google Scholar
Massey Jr., F.J.: The Kolmogorov–Smirnov test for goodness of fit. J. Am. Stat. Assoc. 46(253), 68–78 (1951)
MATH Google Scholar
Mazza, A., Punzo, A., Ingrassia, S.: flexCWM: a flexible framework for cluster-weighted models. J. Stat. Softw. 86(2), 1–30 (2018)
Google Scholar
Meilă, M.: Comparing clusterings: an axiomatic view. In: ICML’05 Proceedings of the 22nd International Conference on Machine Learning, pp. 577–584 (2005)
Meilă, M.: Comparing clusterings—an information based distance. J. Multivar. Anal. 98, 873–895 (2007)
MathSciNet MATH Google Scholar
Meila, M.: Criteria for comparing clusterings. In: Hennig, C., Meila, M., Murtagh, F., Rocci, R. (eds.) Handbook of Cluster Analysis, pp. 619–637. Chapman and Hall/CRC, New York (2016)
Google Scholar
Meilă, M., Heckerman, D.: An experimental comparison of model-based clustering methods. In: Proceedings of the Conference on Knowledge Discovery and Data Mining, pp. 16–22 (1999)
Nagy, I., Suzdaleva, E.: Algorithms and Programs of Dynamic Mixture Estimation. Unified Approach to Different Types of Components. Springer Briefs in Statistics. Springer International Publishing, Heidelberg (2017)
MATH Google Scholar
Ng, R., Han, J.: Efficient and effective clustering method for spatial data mining. In: Proc. 1994 Int. Conf. Very Large Data Bases (VLDB’94), Santiago, pp. 144–155 (1994)
Punzo, A., Igrassia, S.: Clustering bivariate mixed-type data via the cluster-weight model. Compu. Stat. 31(3), 989–1013 (2016)
MATH Google Scholar
Rand, W.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66, 846–850 (1971)
Google Scholar
Reilly, C., Wang, Ch., Rutherford, M.: A rapid method for the comparison of cluster analyses. Stat. Sin. 15, 19–33 (2005)
MathSciNet MATH Google Scholar
Rodríguez-Ramos, A., da Silva Neto, A.J., Llanes-Santiago, O.: An approach to fault diagnosis with online detection of novel faults using fuzzy clustering tools. Expert Syst. Appl. 113, 200–212 (2018)
Google Scholar
Saini, S., Rani, P.: A survey on STING and CLIQUE grid based clustering methods. Int. J. Adv. Res. Comput. Sci. 8(5), 1510–1512 (2017)
Google Scholar
Shiau, W.-L., Dwivedi, Y.K., Yang, H.S.: Co-citation and cluster analyses of extant literature on social networks. Int. J. Inf. Manag. 37(5), 390–399 (2017)
Google Scholar
Sirsikar, S., Wankhede, K.: Comparison of clustering algorithms to design new clustering approach. Proc. Comput. Sci. 49, 147–154 (2015)
Google Scholar
Schütz, T., Schraven, M.H., Fuchs, M., Remmen, P., Müller, D.: Comparison of clustering algorithms for the selection of typical demand days for energy system synthesis. Renew. Energy 129(A), 570–582 (2018)
Google Scholar
Steinhaus, H.: Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci. (C1.III) IV, 801–804 (1956)
MathSciNet MATH Google Scholar
Strehl, A., Ghosh, J.: Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3, 583–617 (2002)
MathSciNet MATH Google Scholar
Suzdaleva, E., Nagy, I., Pecherková, P., Likhonina, R.: Initialization of recursive mixture-based clustering with uniform components. In: Proceedings of the 14th International Conference on Informatics in Control, Automation and Robotics (ICINCO 2017), pp. 449–458 (2017)
Suzdaleva, E., Nagy, I.: An online estimation of driving style using data-dependent pointer model. Transp. Res. Part C 86C, 23–36 (2018)
Google Scholar
Tan, P.-N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Pearson, London (2005)
Google Scholar
Tari, F.G., Hashemi, Z.: Prioritized K-mean clustering hybrid GA for discounted fixed charge transportation problems. Comput. Ind. Eng. 126, 63–74 (2018)
Google Scholar
Umesh, U.N., Peterson, R.A., Sauber, M.H.: Interjudge agreement and the maximum value of kappa. Educ. Psychol. Meas. 49, 835–850 (1989)
Google Scholar
van Dongen, S.: Performance criteria for graph clustering and Markov cluster experiments. Technical Report INS-R0012, Centrum voor Wiskunde en Informatica (2000)
Vendramin, L., Campello, R.J.G.B., Hruschka, E.R.: Relative clustering validity criteria: a comparative overview. Stat. Anal. Data Min. 3(4), 209–235 (2010)
MathSciNet Google Scholar
Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, 2837–2854 (2010)
MathSciNet MATH Google Scholar
Wagner, S., Wagner, D.: Comparing Clusterings—An Overview. Technical Report 2006–04 (2007)
Yin, X., Chen, S., Hu, E., Zhang, D.: Semi-supervised clustering with metric learning: an adaptive kernel method. Pattern Recognit. 43(4), 1320–1333 (2010)
MATH Google Scholar
Zaki, M.J., Meira Jr., W.: Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press, Cambridge (2014)
MATH Google Scholar

Download references

Acknowledgements

This work has been supported by the project SILENSE, Project Number ECSEL 737487 and MSMT 8A17006.

Author information

Authors and Affiliations

Department of Signal Processing, The Czech Academy of Sciences, Institute of Information Theory and Automation, Pod vodárenskou věží 4, 18208, Prague, Czech Republic
Evženie Uglickich & Ivan Nagy
Faculty of Transportation Sciences, Czech Technical University, Na Florenci 25, 11000, Prague, Czech Republic
Ivan Nagy
Faculty of Nuclear Sciences and Physical Engineering, Czech Technical University, Břehová 7, 11519, Prague, Czech Republic
Dominika Vlčková

Authors

Evženie Uglickich
View author publications
You can also search for this author in PubMed Google Scholar
Ivan Nagy
View author publications
You can also search for this author in PubMed Google Scholar
Dominika Vlčková
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Evženie Uglickich.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Uglickich, E., Nagy, I. & Vlčková, D. Comparing clusterings using combination of the kappa statistic and entropy-based measure. METRON 77, 253–270 (2019). https://doi.org/10.1007/s40300-019-00162-5

Download citation

Received: 08 March 2019
Accepted: 07 November 2019
Published: 16 November 2019
Issue Date: December 2019
DOI: https://doi.org/10.1007/s40300-019-00162-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comparing clusterings using combination of the kappa statistic and entropy-based measure

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Comparing clusterings using combination of the kappa statistic and entropy-based measure

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation