Skip to main content
Log in

Comparing clusterings using combination of the kappa statistic and entropy-based measure

  • Published:
METRON Aims and scope Submit manuscript

Abstract

The paper focuses on a problem of comparing clusterings with the same number of clusters obtained as a result of using different clustering algorithms. It proposes a method of the evaluation of the agreement of clusterings based on the combination of the Cohen’s kappa statistic and the normalized mutual information. The main contributions of the proposed approach are: (i) the reliable use in practice in the case of a small fixed number of clusters, (ii) the suitability to comparing clusterings with a higher number of clusters in contrast with the original statistics, (iii) the independence on size of the data set and shape of clusters. Results of the experimental validation of the proposed statistic using both simulations and real data sets as well as the comparison with the theoretical counterparts are demonstrated.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Asioli, D., Berget, I., Næs, T.: Comparison of different clustering methods for investigating individual differences using choice experiments. Food Res. Int. 111, 371–378 (2018)

    Google Scholar 

  2. Andrzejak, R.G., Lehnertz, K., Rieke, C., Mormann, F., David, P., Elger, C.E.: Indications of nonlinear deterministic and finite dimensional structures in time series of brain electrical activity: dependence on recording region and brain state. Phys. Rev. E 64, 061907 (2001)

    Google Scholar 

  3. Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 1960, 196037–196046 (1960)

    Google Scholar 

  4. Dang, U.J., Punzo, A., McNicholas, P.D., Ingrassia, S., Browne, R.P.: Multivariate response and parsimony for Gaussian cluster-weighted models. J. Classif. 34(1), 4–34 (2017)

    MathSciNet  MATH  Google Scholar 

  5. Dua, D., Graff, C.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CA. http://archive.ics.uci.edu/ml (2019)

  6. Dunn, J.C.: A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. Cybern. Syst. 3, 32–57 (1973)

    MathSciNet  MATH  Google Scholar 

  7. Fowlkes, E.B., Mallows, C.L.: A method for comparing two hierarchical clusterings. J. Am. Stat. Assoc. 78, 553–569 (1983)

    MATH  Google Scholar 

  8. Fred, A., Jain, A.: Robust data clustering. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. CVPR 3, 128–136 (2003)

    Google Scholar 

  9. Frühwirth-Schnatter, S.: Finite Mixture and Markov Switching Models, 2nd edn, p. 2006. Springer, New York (2006)

    MATH  Google Scholar 

  10. Guha, S., Rastogi, R., Shim, K.: Rock: a robust clustering algorithm for categorical attributes. Inf. Syst. 25(5), 345–366 (2000)

    Google Scholar 

  11. Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann, Burlington (2011)

    MATH  Google Scholar 

  12. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics), 2nd edn. Springer, New York (2016)

    Google Scholar 

  13. Hidri, M.S., Zoghlami, M.A., Ayed, R.B.: Speeding up the large-scale consensus fuzzy clustering for handling Big Data. Fuzzy Sets Syst. 348, 50–74 (2018)

    MathSciNet  MATH  Google Scholar 

  14. Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2, 193–218 (1985)

    MATH  Google Scholar 

  15. Ingrassia, S., Punzo, A., Vittadini, G., Minotti, S.C.: The generalized linear mixed cluster-weighted model. J. Classif. 32(1), 85–113 (2015)

    MathSciNet  MATH  Google Scholar 

  16. Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 31(8), 651–666 (2010)

    Google Scholar 

  17. Kruskal, W.H., Wallis, W.A.: Use of ranks in one-criterion variance analysis. J. Am. Stat. Assoc. 47(260), 583–621 (1952)

    MATH  Google Scholar 

  18. Larose, D.T.: Discovering Knowledge in Data. An Introduction to Data Mining. Wiley, Hoboken (2005)

    MATH  Google Scholar 

  19. Larsen, B., Aone, C.: Fast and effective text mining using linear time document clustering. In: Proceedings of the KDD, pp. 16–29 (1999)

  20. Lee, S.-H., Jeong, Y.-S., Kim, J.-Y., Jeong, M.K.: A new clustering clusters validity index for arbitrary shape of clusters. Pattern Recognit. Lett. 112, 263–269 (2018)

    Google Scholar 

  21. Li, G., Hu, Y.: Improved sensor fault detection, diagnosis and estimation for screw chillers using density-based clustering and principal component analysis. Energy Build. 173, 502–515 (2018)

    Google Scholar 

  22. Li, T., Ogihara, M., Ma, S.: On combining multiple clusterings. Proc. ACM Conf. Inf. Knowl. Manag. 13, 294–303 (2004)

    Google Scholar 

  23. Maione, C., Nelson, D.R., Barbosa, R.M.: Research on social data by means of cluster analysis. Appl. Comput. Inform. (2018). https://doi.org/10.1016/j.aci.2018.02.003. (in press)

    Article  Google Scholar 

  24. MacCuish, J.D., MacCuish, N.E.: Clustering in Bioinformatics and Drug Discovery. CRC Press, Boca Raton (2010)

    MATH  Google Scholar 

  25. MacCuish, J.D., MacCuish, N.E.: Chemoinformatics applications of cluster analysis. Wiley interdisciplinary reviews. Comput. Mol. Sci. 4(1), 34–48 (2014)

    Google Scholar 

  26. Marinov, I., Luxová, A., Tkácová, V., Gašová, Z., Pohlreich, D., Cetkovský, P.: Comparison of three single platform methods for CD34+ hematopoietic stem cell enumeration by flow cytometry. Clin. Lab. 57(11–12), 1031–1035 (2011)

    Google Scholar 

  27. Massey Jr., F.J.: The Kolmogorov–Smirnov test for goodness of fit. J. Am. Stat. Assoc. 46(253), 68–78 (1951)

    MATH  Google Scholar 

  28. Mazza, A., Punzo, A., Ingrassia, S.: flexCWM: a flexible framework for cluster-weighted models. J. Stat. Softw. 86(2), 1–30 (2018)

    Google Scholar 

  29. Meilă, M.: Comparing clusterings: an axiomatic view. In: ICML’05 Proceedings of the 22nd International Conference on Machine Learning, pp. 577–584 (2005)

  30. Meilă, M.: Comparing clusterings—an information based distance. J. Multivar. Anal. 98, 873–895 (2007)

    MathSciNet  MATH  Google Scholar 

  31. Meila, M.: Criteria for comparing clusterings. In: Hennig, C., Meila, M., Murtagh, F., Rocci, R. (eds.) Handbook of Cluster Analysis, pp. 619–637. Chapman and Hall/CRC, New York (2016)

    Google Scholar 

  32. Meilă, M., Heckerman, D.: An experimental comparison of model-based clustering methods. In: Proceedings of the Conference on Knowledge Discovery and Data Mining, pp. 16–22 (1999)

  33. Nagy, I., Suzdaleva, E.: Algorithms and Programs of Dynamic Mixture Estimation. Unified Approach to Different Types of Components. Springer Briefs in Statistics. Springer International Publishing, Heidelberg (2017)

    MATH  Google Scholar 

  34. Ng, R., Han, J.: Efficient and effective clustering method for spatial data mining. In: Proc. 1994 Int. Conf. Very Large Data Bases (VLDB’94), Santiago, pp. 144–155 (1994)

  35. Punzo, A., Igrassia, S.: Clustering bivariate mixed-type data via the cluster-weight model. Compu. Stat. 31(3), 989–1013 (2016)

    MATH  Google Scholar 

  36. Rand, W.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66, 846–850 (1971)

    Google Scholar 

  37. Reilly, C., Wang, Ch., Rutherford, M.: A rapid method for the comparison of cluster analyses. Stat. Sin. 15, 19–33 (2005)

    MathSciNet  MATH  Google Scholar 

  38. Rodríguez-Ramos, A., da Silva Neto, A.J., Llanes-Santiago, O.: An approach to fault diagnosis with online detection of novel faults using fuzzy clustering tools. Expert Syst. Appl. 113, 200–212 (2018)

    Google Scholar 

  39. Saini, S., Rani, P.: A survey on STING and CLIQUE grid based clustering methods. Int. J. Adv. Res. Comput. Sci. 8(5), 1510–1512 (2017)

    Google Scholar 

  40. Shiau, W.-L., Dwivedi, Y.K., Yang, H.S.: Co-citation and cluster analyses of extant literature on social networks. Int. J. Inf. Manag. 37(5), 390–399 (2017)

    Google Scholar 

  41. Sirsikar, S., Wankhede, K.: Comparison of clustering algorithms to design new clustering approach. Proc. Comput. Sci. 49, 147–154 (2015)

    Google Scholar 

  42. Schütz, T., Schraven, M.H., Fuchs, M., Remmen, P., Müller, D.: Comparison of clustering algorithms for the selection of typical demand days for energy system synthesis. Renew. Energy 129(A), 570–582 (2018)

    Google Scholar 

  43. Steinhaus, H.: Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci. (C1.III) IV, 801–804 (1956)

    MathSciNet  MATH  Google Scholar 

  44. Strehl, A., Ghosh, J.: Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3, 583–617 (2002)

    MathSciNet  MATH  Google Scholar 

  45. Suzdaleva, E., Nagy, I., Pecherková, P., Likhonina, R.: Initialization of recursive mixture-based clustering with uniform components. In: Proceedings of the 14th International Conference on Informatics in Control, Automation and Robotics (ICINCO 2017), pp. 449–458 (2017)

  46. Suzdaleva, E., Nagy, I.: An online estimation of driving style using data-dependent pointer model. Transp. Res. Part C 86C, 23–36 (2018)

    Google Scholar 

  47. Tan, P.-N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Pearson, London (2005)

    Google Scholar 

  48. Tari, F.G., Hashemi, Z.: Prioritized K-mean clustering hybrid GA for discounted fixed charge transportation problems. Comput. Ind. Eng. 126, 63–74 (2018)

    Google Scholar 

  49. Umesh, U.N., Peterson, R.A., Sauber, M.H.: Interjudge agreement and the maximum value of kappa. Educ. Psychol. Meas. 49, 835–850 (1989)

    Google Scholar 

  50. van Dongen, S.: Performance criteria for graph clustering and Markov cluster experiments. Technical Report INS-R0012, Centrum voor Wiskunde en Informatica (2000)

  51. Vendramin, L., Campello, R.J.G.B., Hruschka, E.R.: Relative clustering validity criteria: a comparative overview. Stat. Anal. Data Min. 3(4), 209–235 (2010)

    MathSciNet  Google Scholar 

  52. Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, 2837–2854 (2010)

    MathSciNet  MATH  Google Scholar 

  53. Wagner, S., Wagner, D.: Comparing Clusterings—An Overview. Technical Report 2006–04 (2007)

  54. Yin, X., Chen, S., Hu, E., Zhang, D.: Semi-supervised clustering with metric learning: an adaptive kernel method. Pattern Recognit. 43(4), 1320–1333 (2010)

    MATH  Google Scholar 

  55. Zaki, M.J., Meira Jr., W.: Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press, Cambridge (2014)

    MATH  Google Scholar 

Download references

Acknowledgements

This work has been supported by the project SILENSE, Project Number ECSEL 737487 and MSMT 8A17006.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Evženie Uglickich.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Uglickich, E., Nagy, I. & Vlčková, D. Comparing clusterings using combination of the kappa statistic and entropy-based measure. METRON 77, 253–270 (2019). https://doi.org/10.1007/s40300-019-00162-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s40300-019-00162-5

Keywords

Navigation