An equi-biased k-prototypes algorithm for clustering mixed-type data

Sangam, Ravi Sankar; Om, Hari

doi:10.1007/s12046-018-0823-0

An equi-biased k-prototypes algorithm for clustering mixed-type data

Published: 14 March 2018

Volume 43, article number 37, (2018)
Cite this article

Sādhanā Aims and scope Submit manuscript

Ravi Sankar Sangam¹ &
Hari Om¹

542 Accesses
13 Citations
Explore all metrics

Abstract

Clustering has been recognized as a very important approach for data analysis that partitions the data according to some (dis)similarity criterion. In recent years, the problem of clustering mixed-type data has attracted many researchers. The k-prototypes algorithm is well known for its scalability in this respect. In this paper, the limitations of dissimilarity coefficient used in the k-prototypes algorithm are discussed with some illustrative examples. We propose a new hybrid dissimilarity coefficient for k-prototypes algorithm, which can be applied to the data with numerical, categorical and mixed attributes. Besides retaining the scalability of the k-prototypes algorithm in our method, the dissimilarity functions for either-type attributes are defined on the same scale with respect to their dimensionality, which is very beneficial to improve the efficiency of clustering result. The efficacy of our method is shown by experiments on real and synthetic data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Chen M S, Han J and Yu P S 1996 Data mining: an overview from a database perspective. IEEE Transactions on Knowledge and Data Engineering 8(6): 866–883
Article Google Scholar
Jain A K, Duin R P W and Mao J 2000 Statistical pattern recognition: a review. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(1): 4–37
Article Google Scholar
Masulli F and Schenone A 1999 A fuzzy clustering based segmentation system as support to diagnosis in medical imaging. Artificial Intelligence in Medicine 16(2): 129–147
Article Google Scholar
Chen L, Zou L J, and Tu L 2012 A clustering algorithm for multiple data streams based on spectral component similarity. Information Sciences 183(1): 35–47
Article Google Scholar
Krishna K, Ramakrishnan K R and Thathachar M A L 1997 Vector quantization using genetic k-means algorithm for image compression. In: IEEE Proceedings of International Conference on Information Communications and Signal Processing, vol. 3, pp. 1585–1587
Article Google Scholar
Charikar M, Chekuri C, Feder T and Motwani R 2004 Incremental clustering and dynamic information retrieval. SIAM Journal on Computing 33(6): 1417–1440
Article MathSciNet MATH Google Scholar
Han J, Pei J and Kamber M 2011 Data mining: concepts and techniques. Elsevier
Anderberg M R 2014 Cluster analysis for applications: probability and mathematical statistics: a series of monographs and textbooks. Academic Press
MacQueen J 1967 Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1(14), pp. 281–297
MathSciNet MATH Google Scholar
Dunn J C 1973 A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. Journal of Cybernetics 3: 32–57
Article MathSciNet MATH Google Scholar
Huang Z 1997 A fast clustering algorithm to cluster very large categorical data sets in data mining. Data Mining and Knowledge Discovery 3(8): 34–39
Google Scholar
Huang Z and Ng M K 1999 A fuzzy k-modes algorithm for clustering categorical data. IEEE Transactions on Fuzzy Systems 7(4): 446–452
Article Google Scholar
Guha S, Rastogi R and Shim K 1999 ROCK: a robust clustering algorithm for categorical attributes. In: IEEE Proceedings of the Fifteenth International Conference on Data Engineering, pp. 512–521
Barbara D, Li Y and Couto J 2002 COOLCAT: an entropy-based algorithm for categorical clustering. In: ACM Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 582–589
Hsu C C and Chen Y C 2007 Mining of mixed data with application to catalog marketing. Expert Systems with Applications 32(1): 12–23
Article Google Scholar
Li C and Biswas G 2002 Unsupervised learning with mixed numeric and nominal data. IEEE Transactions on Knowledge and Data Engineering 14(4): 673–690
Article Google Scholar
Huang Z 1997 Clustering large data sets with mixed numeric and categorical values. In: Proceedings of the 1st Pacific–Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp. 21–34.
Huang Z 1998 Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery 2(3): 283–304
Article MathSciNet Google Scholar
Berkhin P 2006 A survey of clustering data mining techniques. In: Grouping multidimensional data, pp. 25–71
Gan G, Ma C and Wu J 2007 Data clustering: theory, algorithms, and applications. Society for Industrial and Applied Mathematics
Jain A K, Murty M N and Flynn P J 1999 Data clustering: a review. ACM Computing Surveys (CSUR) 31(3): 264–323
Article Google Scholar
Xu R and Wunsch D 2005 Survey of clustering algorithms. IEEE Transactions on Neural Networks 16(3): 645–678
Article Google Scholar
Goodall D W 1966 A new similarity index based on probability. Biometrics 22(4): 882–907
Article Google Scholar
He Z, Xu X and Deng S 2005 Scalable algorithms for clustering large datasets with mixed type attributes. International Journal of Intelligent Systems 20(10): 1077–1089
Article MATH Google Scholar
He Z, Xu X and Deng S 2002 Squeezer: an efficient algorithm for clustering categorical data. Journal of Computer Science and Technology 17(5): 611–624
Article MathSciNet MATH Google Scholar
David G and Averbuch A 2012 SpectralCAT: categorical spectral clustering of numerical and nominal data. Pattern Recognition 45(1): 416–433
Article MATH Google Scholar
Luo H, Kong F and Li Y 2006 Clustering mixed data based on evidence accumulation. In: Advanced data mining and applications. Berlin–Heidelberg: Springer, pp. 348–355
Chapter Google Scholar
Cheeseman P and Stutz J 1996 Bayesian classification (AutoClass): theory and results. In: Advances in knowledge discovery and data mining, pp. 61–83
Chiu T, Fang D, Chen J, Wang Y and Jeris C 2001 A robust and scalable clustering algorithm for mixed type attributes in large database environment. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp. 263–268
Chen H L, Chuang K T and Chen M S 2008 On data labeling for clustering categorical data. IEEE Transactions on Knowledge and Data Engineering 20(11): 1458–1472
Article Google Scholar
Cheung Y M and Jia H 2013 Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number. Pattern Recognition 46(8): 2228–2238
Article MATH Google Scholar
Ji J, Bai T, Zhou C, Ma C and Wang Z 2013 An improved k-prototypes clustering algorithm for mixed numeric and categorical data. Neurocomputing 120: 590–596
Article Google Scholar
San O M, Huynh V N and Nakamori Y 2004 An alternative extension of the k-means algorithm for clustering categorical data. International Journal of Applied Mathematics and Computer Science 14: 241–247
MathSciNet MATH Google Scholar
He Z, Deng S and Xu X 2005 Improving k-modes algorithm considering frequencies of attribute values in mode. In: Computational intelligence and security. Berlin–Heidelberg: Springer, pp. 157–162
Chapter Google Scholar
Ng M K, Li M J, Huang J Z and He Z 2007 On the impact of dissimilarity measure in k-modes clustering algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(3): 503–507
Article Google Scholar
Rokach L 2005 A survey of clustering dlgorithms. In: Maimon O Z and Rokach L (Eds.) Data mining and knowledge discovery handbook. New York: Springer
Google Scholar
Gabor M 1999 The datgen dataset generator. http://www.datasetgenerator.com
Bache K and Lichman M 2013 UCI machine learning repository. http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Indian Institute of Technology (ISM), Dhanbad, Jharkhand, 826004, India
Ravi Sankar Sangam & Hari Om

Authors

Ravi Sankar Sangam
View author publications
You can also search for this author in PubMed Google Scholar
Hari Om
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ravi Sankar Sangam.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sangam, R.S., Om, H. An equi-biased k-prototypes algorithm for clustering mixed-type data. Sādhanā 43, 37 (2018). https://doi.org/10.1007/s12046-018-0823-0

Download citation

Received: 03 June 2014
Revised: 01 February 2018
Accepted: 06 February 2018
Published: 14 March 2018
DOI: https://doi.org/10.1007/s12046-018-0823-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An equi-biased k-prototypes algorithm for clustering mixed-type data

Abstract

Access this article

Similar content being viewed by others

Towards Cluster-Based Prototype Sets for Classification in the Dissimilarity Space

Cluster Analysis on Different Data Sets Using K-Modes and K-Prototype Algorithms

A k-Means-Like Algorithm for Clustering Categorical Data Using an Information Theoretic-Based Dissimilarity Measure

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An equi-biased k-prototypes algorithm for clustering mixed-type data

Abstract

Access this article

Similar content being viewed by others

Towards Cluster-Based Prototype Sets for Classification in the Dissimilarity Space

Cluster Analysis on Different Data Sets Using K-Modes and K-Prototype Algorithms

A k-Means-Like Algorithm for Clustering Categorical Data Using an Information Theoretic-Based Dissimilarity Measure

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation