Skip to main content
Log in

A dissimilarity measure for mixed nominal and ordinal attribute data in k-Modes algorithm

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Among the existing clustering algorithms, the k-Means algorithm is one of the most commonly used clustering methods. As an extension of the k-Means algorithm, the k-Modes algorithm has been widely applied to categorical data clustering by replacing means with modes. However, there are more mixed-type data containing categorical, ordinal and numerical attributes. Mixed-type data clustering problem has recently attracted much attention from the data mining research community, but most of them fail to notice the ordinal attributes and establish explicit metric similarity of ordinal attributes. In this paper, the limitations of some existing dissimilarity measure of k-Modes algorithm in mixed ordinal and nominal data are analyzed by using some illustrative examples. Based on the idea of mining ordinal information of ordinal attribute, a new dissimilarity measure for the k-Modes algorithm to cluster this type of data is proposed. The distinct characteristic of the new dissimilarity measure is to take account of the ordinal information of ordinal attribute. A convergence study and time complexity of the k-Modes algorithm based on this new dissimilarity measure indicates that it can be effectively used for large data sets. The results of comparative experiments on nine real data sets from UCI show the effectiveness of the new dissimilarity measure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. Jiang F, Liu G (2016) Initialization of K-modes clustering using outlier detection techniques. Inf Sci 332:167–183

    Article  MATH  Google Scholar 

  2. Ding S, Du M, Sun T, et al. (2017) An entropy-based density peaks clustering algorithm for mixed type data employing fuzzy neighborhood[J]. Knowl-Based Syst 294-313:133

    Google Scholar 

  3. Yu H, Chang Z, Zhou B. (2017) A novel three-way clustering algorithm for mixed-type data[C]. In: 2017 IEEE International Conference on Big Knowledge (ICBK), IEEE, pp 119–126

  4. Noorbehbahani F, Mousavi S R, Mirzaei A. (2015) An incremental mixed data clustering method using a new distance measure[J]. Soft Comput 19:731–743

    Article  Google Scholar 

  5. Rajan V, Bhattacharya S (2016) Dependency clustering of mixed data with gaussian mixture copulas[C], IJCAI-16: 1967–1973

  6. Cao F, Liang J, Li D, et al. (2012) A dissimilarity measure for the k-Modes clustering algorithm[J]. Knowl-Based Syst 26:120–127

    Article  Google Scholar 

  7. He Z, Xu X, Deng S (2011) Attribute value weighting in k-modes clustering[J]. Expert Syst Appl 38 (12):15365–15369

    Article  Google Scholar 

  8. Gates AJ, Ahn YY (2017) The impact of random models on clustering similarity[J]. J Mach Learn Res 18 (1):3049–3076

    MathSciNet  Google Scholar 

  9. Herawan T, Deris MM, Abawajy JH (2010) A rough set approach for selecting clustering attribute[J]. Knowl-Based Syst 23(3):220–231

    Article  Google Scholar 

  10. Yang P, Zhu Q (2011) Finding key attribute subset in dataset for outlier detection[J]. Knowl-Based Syst 24(2):269–274

    Article  Google Scholar 

  11. Ng MK, Li MJ, Huang JZ et al (2007) On the impact of dissimilarity measure in k-modes clustering algorithm[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(3):503–507

    Article  Google Scholar 

  12. Hsu CC, Chen CL, Su YW (2007) Hierarchical clustering of mixed data based on distance hierarchy[J]. Inf Sci 177(20):4474–4492

    Article  Google Scholar 

  13. Hsu CC, Chen YC (2007) Mining of mixed data with application to catalog marketing[J]. Expert Syst Appl 32(1):12–23

    Article  Google Scholar 

  14. Gates AJ, Ahn YY (2017) The impact of random models on clustering similarity[J]. J Mach Learn Res 18 (1):3049–3076

    MathSciNet  Google Scholar 

  15. Parmar D, Wu T, Blackhurst J (2007) MMR: An algorithm for clustering categorical data using Rough Set Theory[J]. Data Knowl Eng 63(3):879–893

    Article  Google Scholar 

  16. Chen CB, Wang LY (2006) Rough set-based clustering with refinement using shannon’s entropy theory[J]. Comput Math Appl 52(10-11):1563–1576

    Article  MathSciNet  MATH  Google Scholar 

  17. Boriah S, Chandola V, Kumar V (2008) Similarity measures for categorical data: A comparative evaluation[C]. In: Proceedings of the 2008 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, pp 243–254

  18. Ahmad A, Dey L (2007) A k-mean clustering algorithm for mixed numeric and categorical data[J]. Data Knowl Eng 63(2):503–527

    Article  Google Scholar 

  19. Li C, Biswas G (2002) Unsupervised learning with mixed numeric and nominal data[J]. IEEE Trans Knowl Data Eng 4:673–690

    Article  Google Scholar 

  20. Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values[J]. Data Min Knowl Disc 2(3):283–304

    Article  Google Scholar 

  21. Huang Z (1997) Clustering large data sets with mixed numeric and categorical values[C]. In: Proceedings of the 1st Pacific-asia Conference on Knowledge Discovery and Data Mining,(PAKDD), pp 21–34

  22. Gibson D, Kleinberg J, Raghavan P (1998) Clustering categorical data: an approach based on dynamical systems[J]. Databases 1:75

    Google Scholar 

  23. Goodall DW (1966) A new similarity index based on probability[J]. Biometrics, 882–907

  24. Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques[M]. elsevier

  25. Zaki MJ, Meira W Jr, Meira W (2014) Data mining and analysis: fundamental concepts and algorithms[M]. Cambridge University Press

  26. Huang Z, Ng MK (1999) A fuzzy k-modes algorithm for clustering categorical data[J]. IEEE Trans Fuzzy Syst 7(4):446–452

    Article  Google Scholar 

  27. Pawlak Z (1982) Rough sets[J]. Int J Comput Inf Sci 11(5):341–356

    Article  MATH  Google Scholar 

  28. Jiang F, Sui Y, Cao C (2008) A rough set approach to outlier detection[J]. Int J Gen Syst 37(5):519–536

    Article  MATH  Google Scholar 

  29. Cao F, Liang J, Bai L et al (2010) A framework for clustering categorical time-evolving data[J]. IEEE Transactions on Fuzzy Systems 18(5):872–882

    Article  Google Scholar 

  30. Brouwer RK (2006) A method for fuzzy clustering with ordinal attributes replaced by fuzzy set parameters[C]. In: 2006 3rd International IEEE Conference on Intelligent Systems, IEEE, pp 553–558

  31. Jian S, Cao L, Lu K, Gao H (2018) Unsupervised coupled metric similarity for non-IID categorical data. Trans Knowl Data Eng 30(9):1810–1823

    Article  Google Scholar 

  32. Qian Y, Li F et al (2016) Space structure and clustering of categorical data. Trans Neur Net Lear Syst 27(10):2047– 2059

    Article  MathSciNet  Google Scholar 

  33. UCI Machine Learning Repository< http://archive.ics.uci.edu/ml/datasets.h

Download references

Acknowledgments

We would also like to thank the anonymous reviewers for their helpful suggestions. This work was supported by National Natural Science Foundation of China(61573266).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fang Yuan.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yuan, F., Yang, Y. & Yuan, T. A dissimilarity measure for mixed nominal and ordinal attribute data in k-Modes algorithm. Appl Intell 50, 1498–1509 (2020). https://doi.org/10.1007/s10489-019-01583-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-019-01583-5

Keywords

Navigation