Abstract
Clustering has been recognized as a very important approach for data analysis that partitions the data according to some (dis)similarity criterion. In recent years, the problem of clustering mixed-type data has attracted many researchers. The k-prototypes algorithm is well known for its scalability in this respect. In this paper, the limitations of dissimilarity coefficient used in the k-prototypes algorithm are discussed with some illustrative examples. We propose a new hybrid dissimilarity coefficient for k-prototypes algorithm, which can be applied to the data with numerical, categorical and mixed attributes. Besides retaining the scalability of the k-prototypes algorithm in our method, the dissimilarity functions for either-type attributes are defined on the same scale with respect to their dimensionality, which is very beneficial to improve the efficiency of clustering result. The efficacy of our method is shown by experiments on real and synthetic data sets.
Similar content being viewed by others
References
Chen M S, Han J and Yu P S 1996 Data mining: an overview from a database perspective. IEEE Transactions on Knowledge and Data Engineering 8(6): 866–883
Jain A K, Duin R P W and Mao J 2000 Statistical pattern recognition: a review. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(1): 4–37
Masulli F and Schenone A 1999 A fuzzy clustering based segmentation system as support to diagnosis in medical imaging. Artificial Intelligence in Medicine 16(2): 129–147
Chen L, Zou L J, and Tu L 2012 A clustering algorithm for multiple data streams based on spectral component similarity. Information Sciences 183(1): 35–47
Krishna K, Ramakrishnan K R and Thathachar M A L 1997 Vector quantization using genetic k-means algorithm for image compression. In: IEEE Proceedings of International Conference on Information Communications and Signal Processing, vol. 3, pp. 1585–1587
Charikar M, Chekuri C, Feder T and Motwani R 2004 Incremental clustering and dynamic information retrieval. SIAM Journal on Computing 33(6): 1417–1440
Han J, Pei J and Kamber M 2011 Data mining: concepts and techniques. Elsevier
Anderberg M R 2014 Cluster analysis for applications: probability and mathematical statistics: a series of monographs and textbooks. Academic Press
MacQueen J 1967 Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1(14), pp. 281–297
Dunn J C 1973 A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. Journal of Cybernetics 3: 32–57
Huang Z 1997 A fast clustering algorithm to cluster very large categorical data sets in data mining. Data Mining and Knowledge Discovery 3(8): 34–39
Huang Z and Ng M K 1999 A fuzzy k-modes algorithm for clustering categorical data. IEEE Transactions on Fuzzy Systems 7(4): 446–452
Guha S, Rastogi R and Shim K 1999 ROCK: a robust clustering algorithm for categorical attributes. In: IEEE Proceedings of the Fifteenth International Conference on Data Engineering, pp. 512–521
Barbara D, Li Y and Couto J 2002 COOLCAT: an entropy-based algorithm for categorical clustering. In: ACM Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 582–589
Hsu C C and Chen Y C 2007 Mining of mixed data with application to catalog marketing. Expert Systems with Applications 32(1): 12–23
Li C and Biswas G 2002 Unsupervised learning with mixed numeric and nominal data. IEEE Transactions on Knowledge and Data Engineering 14(4): 673–690
Huang Z 1997 Clustering large data sets with mixed numeric and categorical values. In: Proceedings of the 1st Pacific–Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp. 21–34.
Huang Z 1998 Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery 2(3): 283–304
Berkhin P 2006 A survey of clustering data mining techniques. In: Grouping multidimensional data, pp. 25–71
Gan G, Ma C and Wu J 2007 Data clustering: theory, algorithms, and applications. Society for Industrial and Applied Mathematics
Jain A K, Murty M N and Flynn P J 1999 Data clustering: a review. ACM Computing Surveys (CSUR) 31(3): 264–323
Xu R and Wunsch D 2005 Survey of clustering algorithms. IEEE Transactions on Neural Networks 16(3): 645–678
Goodall D W 1966 A new similarity index based on probability. Biometrics 22(4): 882–907
He Z, Xu X and Deng S 2005 Scalable algorithms for clustering large datasets with mixed type attributes. International Journal of Intelligent Systems 20(10): 1077–1089
He Z, Xu X and Deng S 2002 Squeezer: an efficient algorithm for clustering categorical data. Journal of Computer Science and Technology 17(5): 611–624
David G and Averbuch A 2012 SpectralCAT: categorical spectral clustering of numerical and nominal data. Pattern Recognition 45(1): 416–433
Luo H, Kong F and Li Y 2006 Clustering mixed data based on evidence accumulation. In: Advanced data mining and applications. Berlin–Heidelberg: Springer, pp. 348–355
Cheeseman P and Stutz J 1996 Bayesian classification (AutoClass): theory and results. In: Advances in knowledge discovery and data mining, pp. 61–83
Chiu T, Fang D, Chen J, Wang Y and Jeris C 2001 A robust and scalable clustering algorithm for mixed type attributes in large database environment. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp. 263–268
Chen H L, Chuang K T and Chen M S 2008 On data labeling for clustering categorical data. IEEE Transactions on Knowledge and Data Engineering 20(11): 1458–1472
Cheung Y M and Jia H 2013 Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number. Pattern Recognition 46(8): 2228–2238
Ji J, Bai T, Zhou C, Ma C and Wang Z 2013 An improved k-prototypes clustering algorithm for mixed numeric and categorical data. Neurocomputing 120: 590–596
San O M, Huynh V N and Nakamori Y 2004 An alternative extension of the k-means algorithm for clustering categorical data. International Journal of Applied Mathematics and Computer Science 14: 241–247
He Z, Deng S and Xu X 2005 Improving k-modes algorithm considering frequencies of attribute values in mode. In: Computational intelligence and security. Berlin–Heidelberg: Springer, pp. 157–162
Ng M K, Li M J, Huang J Z and He Z 2007 On the impact of dissimilarity measure in k-modes clustering algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(3): 503–507
Rokach L 2005 A survey of clustering dlgorithms. In: Maimon O Z and Rokach L (Eds.) Data mining and knowledge discovery handbook. New York: Springer
Gabor M 1999 The datgen dataset generator. http://www.datasetgenerator.com
Bache K and Lichman M 2013 UCI machine learning repository. http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Sangam, R.S., Om, H. An equi-biased k-prototypes algorithm for clustering mixed-type data. Sādhanā 43, 37 (2018). https://doi.org/10.1007/s12046-018-0823-0
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s12046-018-0823-0