Data Mining and Knowledge Discovery

, Volume 2, Issue 3, pp 283–304 | Cite as

Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values

  • Zhexue Huang
Article

Abstract

The k-means algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the k-means algorithm to categorical domains and domains with mixed numeric and categorical values. The k-modes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequency-based method to update modes in the clustering process to minimise the clustering cost function. With these extensions the k-modes algorithm enables the clustering of categorical data in a fashion similar to k-means. The k-prototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the k-means and k-modes algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large data sets, which is critical to data mining applications.

data mining cluster analysis clustering algorithms categorical data 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Anderberg, M.R. 1973. Cluster Analysis for Applications. Academic Press.Google Scholar
  2. Ball, G.H. and Hall, D.J. 1967. A clustering technique for summarizing multivariate data. Behavioral Science, 12:153–155.Google Scholar
  3. Bezdek, J.C. 1981. Pattern Recognition with Fuzzy Objective Function. Plenum Press.Google Scholar
  4. Bobrowski, L. and Bezdek, J.C. 1991. c-Means clustering with the l 1 and l norms. IEEE Transactions on Systems, Man and Cybernetics, 21(3):545–554.CrossRefMathSciNetMATHGoogle Scholar
  5. Cormack, R.M. 1971. A review of classification. J. Roy. Statist. Soc. Serie A, 134:321–367.MathSciNetGoogle Scholar
  6. Dubes, R. 1987. How many clusters are best? An experiment. Pattern Recognition, 20(6):645–663.CrossRefGoogle Scholar
  7. Dubes, R. and Jian, A.K. 1979. Validity studies in clustering methodologies. Pattern Recognition, 11:235–254.CrossRefMATHGoogle Scholar
  8. Ester, M., Kriegel, H.P., Sander, J., and Xu, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, Oregon, USA: AAAI Press, pp. 226–231.Google Scholar
  9. Everitt, B. 1974. Cluster Analysis. Heinemann Educational Books Ltd.Google Scholar
  10. Fisher, D.H. 1987. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2(2):139–172.Google Scholar
  11. Gowda, K.C. and Diday, E. 1991. Symbolic clustering using a new dissimilarity measure. Pattern Recognition, 24(6):567–578.CrossRefGoogle Scholar
  12. Gower, J.C. 1971. A general coefficient of similarity and some of its properties. BioMetrics, 27:857–874.Google Scholar
  13. Huang, Z. 1997a. Clustering large data sets with mixed numeric and categorical values. Proceedings of the First Pacific Asia Knowledge Discovery and Data Mining Conference, Singapore: World Scientific, pp. 21–34.Google Scholar
  14. Huang, Z. 1997b. A fast clustering algorithm to cluster very large categorical data sets in data mining. Proceedings of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Dept. of Computer Science, The University of British Columbia, Canada, pp. 1–8.Google Scholar
  15. IBM. 1996. Data Management Solutions. IBM White Paper, IBM Corp.Google Scholar
  16. Jain, A.K. and Dubes, R.C. 1988. Algorithms for Clustering Data. Prentice Hall.Google Scholar
  17. Kaufman, L. and Rousseeuw, P.J. 1990. Finding Groups in Data—An Introduction to Cluster Analysis. Wiley.Google Scholar
  18. Klosgen, W. and Zytkow, J.M. 1996. Knowledge discovery in databases terminology. Advances in Knowledge Discovery and Data Mining, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.), AAAI Press/The MIT Press, pp. 573–592.Google Scholar
  19. Kodratoff, Y. and Tecuci, G. 1988. Learning based on conceptual distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 10(6):897–909.CrossRefMATHGoogle Scholar
  20. Lebowitz, M. 1987. Experiments with incremental concept formation. Machine Learning, 2(2):103–138.Google Scholar
  21. MacQueen, J.B. 1967. Some methods for classification and analysis of multivariate observations. Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297.Google Scholar
  22. Michalski, R.S and Stepp, R.E. 1983. Automated construction of classifications: Conceptual clustering versus numerical taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5(4):396–410.Google Scholar
  23. Milligan, G.W. 1981. A Monte Carlo study of thirty internal criterion measures for cluster analysis. Psychometrika, 46(2):187–199.CrossRefMATHMathSciNetGoogle Scholar
  24. Milligan, G.W. 1985. An algorithm for generating artificial test clusters. Psychometrika, 50(1):123–127.CrossRefGoogle Scholar
  25. Milligan, G.W. and Cooper, M.C. 1985. An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50(2):159–179.CrossRefGoogle Scholar
  26. Milligan, G.W. and Isaac, P.D. 1980. The validation of four ultrametric clustering algorithms. Pattern Recognition, 12:41–50.CrossRefGoogle Scholar
  27. Ng, R.T. an d Han J. 1994. Efficient and effective clustering methods for spatial data mining. Proceedings of the 20th VL DB Conference, Santiago, Chile, pp. 144–155.Google Scholar
  28. Quinlan, J.R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers.Google Scholar
  29. Ralambondrainy, H. 1995. A conceptual version of the k-means algorithm. Pattern Recognition Letters, 16:1147–1157.CrossRefGoogle Scholar
  30. Ruspini, E.R. 1969. A new approach to clustering. Information Control, 19:22–32.CrossRefGoogle Scholar
  31. Ruspini, E.R. 1973. New experimental results in fuzzy clustering. Information Sciences, 6:273–284.CrossRefGoogle Scholar
  32. Selim, S.Z. and Ismail, M.A. 1984. k-Means-type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(1):81–87.CrossRefMATHGoogle Scholar
  33. Williams, G.J. and Huang, Z. 1996. A case study in knowledge acquisition for insurance risk assessment using a KDD methodology. Proceedings of the Pacific Rim Knowledge Acquisition Workshop, Dept. of AI, Univ. of NSW, Sydney, Australia, pp. 117–129.Google Scholar
  34. Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: An efficient data clustering method for very large databases. Proceedings of ACM SIGMOD Conference, Montreal, Canada, pp. 103–114.Google Scholar

Copyright information

© Kluwer Academic Publishers 1998

Authors and Affiliations

  • Zhexue Huang
    • 1
  1. 1.ACSys CRC, CSIRO Mathematical and Information SciencesCanberraAustralia

Personalised recommendations