Skip to main content
Log in

An equi-biased k-prototypes algorithm for clustering mixed-type data

  • Published:
Sādhanā Aims and scope Submit manuscript

Abstract

Clustering has been recognized as a very important approach for data analysis that partitions the data according to some (dis)similarity criterion. In recent years, the problem of clustering mixed-type data has attracted many researchers. The k-prototypes algorithm is well known for its scalability in this respect. In this paper, the limitations of dissimilarity coefficient used in the k-prototypes algorithm are discussed with some illustrative examples. We propose a new hybrid dissimilarity coefficient for k-prototypes algorithm, which can be applied to the data with numerical, categorical and mixed attributes. Besides retaining the scalability of the k-prototypes algorithm in our method, the dissimilarity functions for either-type attributes are defined on the same scale with respect to their dimensionality, which is very beneficial to improve the efficiency of clustering result. The efficacy of our method is shown by experiments on real and synthetic data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6

Similar content being viewed by others

References

  1. Chen M S, Han J and Yu P S 1996 Data mining: an overview from a database perspective. IEEE Transactions on Knowledge and Data Engineering 8(6): 866–883

    Article  Google Scholar 

  2. Jain A K, Duin R P W and Mao J 2000 Statistical pattern recognition: a review. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(1): 4–37

    Article  Google Scholar 

  3. Masulli F and Schenone A 1999 A fuzzy clustering based segmentation system as support to diagnosis in medical imaging. Artificial Intelligence in Medicine 16(2): 129–147

    Article  Google Scholar 

  4. Chen L, Zou L J, and Tu L 2012 A clustering algorithm for multiple data streams based on spectral component similarity. Information Sciences 183(1): 35–47

    Article  Google Scholar 

  5. Krishna K, Ramakrishnan K R and Thathachar M A L 1997 Vector quantization using genetic k-means algorithm for image compression. In: IEEE Proceedings of International Conference on Information Communications and Signal Processing, vol. 3, pp. 1585–1587

    Article  Google Scholar 

  6. Charikar M, Chekuri C, Feder T and Motwani R 2004 Incremental clustering and dynamic information retrieval. SIAM Journal on Computing 33(6): 1417–1440

    Article  MathSciNet  MATH  Google Scholar 

  7. Han J, Pei J and Kamber M 2011 Data mining: concepts and techniques. Elsevier

  8. Anderberg M R 2014 Cluster analysis for applications: probability and mathematical statistics: a series of monographs and textbooks. Academic Press

  9. MacQueen J 1967 Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1(14), pp. 281–297

    MathSciNet  MATH  Google Scholar 

  10. Dunn J C 1973 A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. Journal of Cybernetics 3: 32–57

    Article  MathSciNet  MATH  Google Scholar 

  11. Huang Z 1997 A fast clustering algorithm to cluster very large categorical data sets in data mining. Data Mining and Knowledge Discovery 3(8): 34–39

    Google Scholar 

  12. Huang Z and Ng M K 1999 A fuzzy k-modes algorithm for clustering categorical data. IEEE Transactions on Fuzzy Systems 7(4): 446–452

    Article  Google Scholar 

  13. Guha S, Rastogi R and Shim K 1999 ROCK: a robust clustering algorithm for categorical attributes. In: IEEE Proceedings of the Fifteenth International Conference on Data Engineering, pp. 512–521

  14. Barbara D, Li Y and Couto J 2002 COOLCAT: an entropy-based algorithm for categorical clustering. In: ACM Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 582–589

  15. Hsu C C and Chen Y C 2007 Mining of mixed data with application to catalog marketing. Expert Systems with Applications 32(1): 12–23

    Article  Google Scholar 

  16. Li C and Biswas G 2002 Unsupervised learning with mixed numeric and nominal data. IEEE Transactions on Knowledge and Data Engineering 14(4): 673–690

    Article  Google Scholar 

  17. Huang Z 1997 Clustering large data sets with mixed numeric and categorical values. In: Proceedings of the 1st Pacific–Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp. 21–34.

  18. Huang Z 1998 Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery 2(3): 283–304

    Article  MathSciNet  Google Scholar 

  19. Berkhin P 2006 A survey of clustering data mining techniques. In: Grouping multidimensional data, pp. 25–71

  20. Gan G, Ma C and Wu J 2007 Data clustering: theory, algorithms, and applications. Society for Industrial and Applied Mathematics

  21. Jain A K, Murty M N and Flynn P J 1999 Data clustering: a review. ACM Computing Surveys (CSUR) 31(3): 264–323

    Article  Google Scholar 

  22. Xu R and Wunsch D 2005 Survey of clustering algorithms. IEEE Transactions on Neural Networks 16(3): 645–678

    Article  Google Scholar 

  23. Goodall D W 1966 A new similarity index based on probability. Biometrics 22(4): 882–907

    Article  Google Scholar 

  24. He Z, Xu X and Deng S 2005 Scalable algorithms for clustering large datasets with mixed type attributes. International Journal of Intelligent Systems 20(10): 1077–1089

    Article  MATH  Google Scholar 

  25. He Z, Xu X and Deng S 2002 Squeezer: an efficient algorithm for clustering categorical data. Journal of Computer Science and Technology 17(5): 611–624

    Article  MathSciNet  MATH  Google Scholar 

  26. David G and Averbuch A 2012 SpectralCAT: categorical spectral clustering of numerical and nominal data. Pattern Recognition 45(1): 416–433

    Article  MATH  Google Scholar 

  27. Luo H, Kong F and Li Y 2006 Clustering mixed data based on evidence accumulation. In: Advanced data mining and applications. Berlin–Heidelberg: Springer, pp. 348–355

    Chapter  Google Scholar 

  28. Cheeseman P and Stutz J 1996 Bayesian classification (AutoClass): theory and results. In: Advances in knowledge discovery and data mining, pp. 61–83

  29. Chiu T, Fang D, Chen J, Wang Y and Jeris C 2001 A robust and scalable clustering algorithm for mixed type attributes in large database environment. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp. 263–268

  30. Chen H L, Chuang K T and Chen M S 2008 On data labeling for clustering categorical data. IEEE Transactions on Knowledge and Data Engineering 20(11): 1458–1472

    Article  Google Scholar 

  31. Cheung Y M and Jia H 2013 Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number. Pattern Recognition 46(8): 2228–2238

    Article  MATH  Google Scholar 

  32. Ji J, Bai T, Zhou C, Ma C and Wang Z 2013 An improved k-prototypes clustering algorithm for mixed numeric and categorical data. Neurocomputing 120: 590–596

    Article  Google Scholar 

  33. San O M, Huynh V N and Nakamori Y 2004 An alternative extension of the k-means algorithm for clustering categorical data. International Journal of Applied Mathematics and Computer Science 14: 241–247

    MathSciNet  MATH  Google Scholar 

  34. He Z, Deng S and Xu X 2005 Improving k-modes algorithm considering frequencies of attribute values in mode. In: Computational intelligence and security. Berlin–Heidelberg: Springer, pp. 157–162

    Chapter  Google Scholar 

  35. Ng M K, Li M J, Huang J Z and He Z 2007 On the impact of dissimilarity measure in k-modes clustering algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(3): 503–507

    Article  Google Scholar 

  36. Rokach L 2005 A survey of clustering dlgorithms. In: Maimon O Z and Rokach L (Eds.) Data mining and knowledge discovery handbook. New York: Springer

    Google Scholar 

  37. Gabor M 1999 The datgen dataset generator. http://www.datasetgenerator.com

  38. Bache K and Lichman M 2013 UCI machine learning repository. http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ravi Sankar Sangam.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sangam, R.S., Om, H. An equi-biased k-prototypes algorithm for clustering mixed-type data. Sādhanā 43, 37 (2018). https://doi.org/10.1007/s12046-018-0823-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s12046-018-0823-0

Keywords

Navigation