Comparison of Similarity Measures for Categorical Data in Hierarchical Clustering

  • Zdeněk ŠulcEmail author
  • Hana Řezanková


This paper deals with similarity measures for categorical data in hierarchical clustering, which can deal with variables with more than two categories, and which aspire to replace the simple matching approach standardly used in this area. These similarity measures consider additional characteristics of a dataset, such as a frequency distribution of categories or the number of categories of a given variable. The paper recognizes two main aims. First, to compare and evaluate the selected similarity measures regarding the quality of produced clusters in hierarchical clustering. Second, to propose new similarity measures for nominal variables. All the examined similarity measures are compared regarding the quality of the produced clusters using the mean ranked scores of two internal evaluation coefficients. The analysis is performed on the generated datasets, and thus, it allows determining in which particular situations a certain similarity measure is recommended for use.


Similarity measures Nominal variables Hierarchical cluster analysis Comparison Evaluation 


Funding Information

This paper was supported by the University of Economics, Prague under the IGA project no. F4/41/2016.


  1. Anderberg, M. R. (1973). Cluster analysis for applications. Probability and mathematical statistics. New York: Academic Press.zbMATHGoogle Scholar
  2. Boriah, S., Chandola, V., Kumar, V. (2008). Similarity measures for categorical data: a comparative evaluation. In Proceedings of the eighth SIAM International Conference on Data Mining (pp. 243–254).Google Scholar
  3. Chandola, V., Boriah, S., Kumar, V. (2009). A framework for exploring categorical data. In Proceedings of the ninth SIAM International Conference on Data Mining (pp. 187–198): SIAM.Google Scholar
  4. Chatuverdi, A., Foods, K., Green, P. E., Carroll, J. D. (2001). K-modes clustering. Journal of Classification, 18(1), 35–55.MathSciNetCrossRefGoogle Scholar
  5. Chen, L., & Guo, G. (2014). Centroid-based classification of categorical data. In Li, F., Li, G., Hwang, S.-w., Yao, B., Zhang, Z. (Eds.) Web-age information management (pp. 472–475). Cham: Springer International Publishing.Google Scholar
  6. Deza, M. M., & Deza, E. (2009). Encyclopedia of distances. Berlin: Springer.CrossRefzbMATHGoogle Scholar
  7. Eskin, E., Arnold, A., Prerau, M., Portnoy, L., Stolfo, S. (2002). A geometric framework for unsupervised anomaly detection, (pp. 77–101). Boston: Springer US.Google Scholar
  8. Everitt, B., Landau, S., Leese, M., Stahl, D. (2011). Cluster analysis. Wiley series in probability and statistics. New York: Wiley.zbMATHGoogle Scholar
  9. Goodall, D. W. (1966). A new similarity index based on probability. Biometrics, 22(4), 882–907.CrossRefGoogle Scholar
  10. Gower, J. C. (1971). A general coefficient of similarity and some of its properties. Biometrics, 27(4), 857–871.CrossRefGoogle Scholar
  11. Hennig, C., Meila, M., Murtagh, F., Rocci, R. (2015). Handbook of cluster analysis. Chapman & Hall/CRC Handbooks of modern statistical methods. Taylor & Francis.Google Scholar
  12. Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3), 283–304.CrossRefGoogle Scholar
  13. Jaccard, P. (1912). The distribution of the flora in the alpine zone. New Phytologist, 11(2), 37–50.CrossRefGoogle Scholar
  14. Lin, D. (1998). An information-theoretic definition of similarity. In Proceedings of the 15th International Conference on Machine Learning (pp. 296–304): Morgan Kaufmann.Google Scholar
  15. Morlini, I., & Zani, S. (2012). A new class of weighted similarity indices using polytomous variables. Journal of Classification, 29(2), 199–226.MathSciNetCrossRefzbMATHGoogle Scholar
  16. Qiu, W., & Joe, H. (2015). clusterGeneration: random cluster generation (with specified degree of separation). R package version 1.3.4.Google Scholar
  17. Qiu, W., & Joe, H. (2016). Generation of random clusters with specified degree of separation. Journal of Classification, 23(2), 315–334.MathSciNetCrossRefzbMATHGoogle Scholar
  18. Řezanková, H. (2009). Cluster analysis and categorical data. Statistika, 89(2), 216–232.Google Scholar
  19. Řezanková, H., Löster, T., Húsek, D. (2011). Evaluation of categorical data clustering. Advances in Intelligent Web Mastering, 3, 173–182.CrossRefGoogle Scholar
  20. San, O. M., Huynh, V. N., Nakamori, Y. (2004). An alternative extension of the k-means algorithm for clustering categorical data. International Journal of Applied Mathematics and Computer Science, 14(2), 241–247.MathSciNetzbMATHGoogle Scholar
  21. Shannon, C. E. (2001). A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review, 5(1), 3–55.MathSciNetCrossRefGoogle Scholar
  22. Sokal, R. R., & Michener, C. D. (1958). A statistical method for evaluating systematic relationships. University of Kansas Scientific Bulletin, 28, 1409–1438.Google Scholar
  23. Spärck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28, 11–21.CrossRefGoogle Scholar
  24. Strauss, T., & von Maltitz, M. J. (2017). Generalising Ward’s method for use with Manhattan distances. PLoS ONE, 12(1), 1–21.CrossRefGoogle Scholar
  25. Šulc, Z., & Řezanková, H. (2015). nomclust: an R package for hierarchical clustering of objects characterized by nominal variables. In Proceedings of the 9th International Days of Statistics and Economics (pp. 1581–1590). Slaný: Melandrium.Google Scholar
  26. Todeschini, R., Consonni, J., Xiang, H., Holliday, V., Buscema, M., Willett, P. (2012). Similarity coefficients for binary chemoinformatics data: overview and extended comparison using simulated and real data sets. Journal of Chemical Information and Modeling, 52(11), 2884–2901.CrossRefGoogle Scholar
  27. Warrens, M. J. (2008). Similarity coefficients for binary data. Ph.D. thesis, University of Leiden.Google Scholar
  28. Warrens, M. J. (2016). Inequalities between similarities for numerical data. Journal of Classification, 33(2), 141–148.MathSciNetCrossRefzbMATHGoogle Scholar
  29. Yi, J., Yang, G., Wan, J. (2016). Category discrimination based feature selection algorithm in chinese text classification. Journal of Information Science and Engineering, 32(5), 1145–1159.MathSciNetGoogle Scholar
  30. Yim, O., & Ramdeen, K. T. (2015). Hierarchical cluster analysis: comparison of three linkage measures and application to psychological data. The Quantitative Methods for Psychology, 11(1), 8–21.CrossRefGoogle Scholar

Copyright information

© The Classification Society 2019

Authors and Affiliations

  1. 1.Department of Statistics and ProbabilityUniversity of Economics, PraguePrague 3Czech Republic

Personalised recommendations