The Minimum Code Length for Clustering Using the Gray Code

  • Mahito Sugiyama
  • Akihiro Yamamoto
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6913)

Abstract

We propose new approaches to exploit compression algorithms for clustering numerical data. Our first contribution is to design a measure that can score the quality of a given clustering result under the light of a fixed encoding scheme. We call this measure the Minimum Code Length (MCL). Our second contribution is to propose a general strategy to translate any encoding method into a cluster algorithm, which we call COOL (COding-Oriented cLustering). COOL has a low computational cost since it scales linearly with the data set size. The clustering results of COOL is also shown to minimize MCL. To illustrate further this approach, we consider the Gray Code as the encoding scheme to present G-COOL. G-COOL can find clusters of arbitrary shapes and remove noise. Moreover, it is robust to change in the input parameters; it requires only two lower bounds for the number of clusters and the size of each cluster, whereas most algorithms for finding arbitrarily shaped clusters work well only if all parameters are tuned appropriately. G-COOL is theoretically shown to achieve internal cohesion and external isolation and is experimentally shown to work well for both synthetic and real data sets.

Keywords

Clustering Compression Discretization Gray code 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Berkhin, P.: A survey of clustering data mining techniques. Grouping Multidimensional Data, 25–71 (2006)Google Scholar
  2. 2.
    Brock, G., Pihur, V., Datta, S., Datta, S.: clValid: An R package for cluster validation. Journal of Statistical Software 25(4), 1–22 (2008)CrossRefGoogle Scholar
  3. 3.
    Chang, F., Qiu, W., Zamar, R.H., Lazarus, R., Wang, X.: clues: An R package for nonparametric clustering based on local shrinking. Journal of Statistical Software 33(4), 1–16 (2010), http://www.jstatsoft.org/v33/i04/ CrossRefGoogle Scholar
  4. 4.
    Chaoji, V., Hasan, M.A., Salem, S., Zaki, M.J.: SPARCL: An effective and efficient algorithm for mining arbitrary shape-based clusters. Knowledge and Information Systems 21(2), 201–229 (2009)CrossRefGoogle Scholar
  5. 5.
    Chaoji, V., Li, G., Yildirim, H., Zaki, M.J.: ABACUS: Mining arbitrary shaped clusters from large datasets based on backbone identification. In: Proceedings of 2011 SIAM International Conference on Data Mining, pp. 295–306 (2011)Google Scholar
  6. 6.
    Cilibrasi, R., Vitányi, P.M.B.: Clustering by compression. IEEE Transactions on Information Theory 51(4), 1523–1545 (2005)CrossRefMATHMathSciNetGoogle Scholar
  7. 7.
    Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, vol. 96, pp. 226–231 (1996)Google Scholar
  8. 8.
    Guha, S., Rastogi, R., Shim, K.: CURE: An efficient clustering algorithm for large databases. Information Systems 26(1), 35–58 (1998)CrossRefMATHGoogle Scholar
  9. 9.
    Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On clustering validation techniques. Journal of Intelligent Information Systems 17(2), 107–145 (2001)CrossRefMATHGoogle Scholar
  10. 10.
    Han, J., Kamber, M.: Data Mining, 2nd edn. Morgan Kaufmann, San Francisco (2006)MATHGoogle Scholar
  11. 11.
    Handl, J., Knowles, J., Kell, D.B.: Computational cluster validation in post-genomic data analysis. Bioinformatics 21(15), 3201 (2005)CrossRefGoogle Scholar
  12. 12.
    Hinneburg, A., Keim, D.A.: An efficient approach to clustering in large multimedia databases with noise. In: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pp. 58–65 (1998)Google Scholar
  13. 13.
    Hubert, L., Arabie, P.: Comparing partitions. Journal of Classification 2(1), 193–218 (1985)CrossRefMATHGoogle Scholar
  14. 14.
    Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: A review. ACM Computing Surveys 31(3), 264–323 (1999)CrossRefGoogle Scholar
  15. 15.
    Karypis, G., Eui-Hong, H., Kumar, V.: CHAMELEON: Hierarchical clustering using dynamic modeling. Computer 32(8), 68–75 (1999)CrossRefGoogle Scholar
  16. 16.
    Keogh, E., Lonardi, S., Ratanamahatana, C., Wei, L., Lee, S.H., Handley, J.: Compression-based data mining of sequential data. Data Mining and Knowledge Discovery 14, 99–129 (2007)CrossRefMathSciNetGoogle Scholar
  17. 17.
    Knuth, D.E.: The Art of Computer Programming. Fascicle 2: Generating All Tuples and Permutations, vol. 4. Addison-Wesley Professional, Reading (2005)MATHGoogle Scholar
  18. 18.
    Kontkanen, P., Myllymäki, P.: An empirical comparison of NML clustering algorithms. In: Proceedings of Information Theory and Statistical Learning, pp. 125–131 (2008)Google Scholar
  19. 19.
    Kontkanen, P., Myllymäki, P., Buntine, W., Rissanen, J., Tirri, H.: An MDL framework for data clustering. In: Grünwald, P., Myung, I.J., Pitt, M. (eds.) Advances in Minimum Description Length: Theory and Applications. MIT Press, Cambridge (2005)Google Scholar
  20. 20.
    Li, M., Badger, J.H., Chen, X., Kwong, S., Kearney, P., Zhang, H.: An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17(2), 149–154 (2001)CrossRefGoogle Scholar
  21. 21.
    MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297 (1967)Google Scholar
  22. 22.
    Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M.: Cluster analysis basics and extensions (2005)Google Scholar
  23. 23.
    Qiu, W., Joe, H.: Generation of random clusters with specified degree of separation. Journal of Classification 23, 315–334 (2006)CrossRefMATHMathSciNetGoogle Scholar
  24. 24.
    R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing (2011), http://www.R-project.org
  25. 25.
    Rasband, W.S.: ImageJ. U. S. National Institutes of Health, Bethesda, Maryland, USA (1997–2011), http://imagej.nih.gov/ij/
  26. 26.
    Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20, 53–65 (1987)CrossRefMATHGoogle Scholar
  27. 27.
    Sheikholeslami, G., Chatterjee, S., Zhang, A.: WaveCluster: A multi-resolution clustering approach for very large spatial databases. In: Proceedings of the 24th International Conference on Very Large Data Bases, pp. 428–439 (1998)Google Scholar
  28. 28.
    Ting, K.M., Wells, J.R.: Multi-dimensional mass estimation and mass-based clustering. In: Proceedings of 10th IEEE International Conference on Data Mining, pp. 511–520 (2010)Google Scholar
  29. 29.
    Tsuiki, H.: Real number computation through Gray code embedding. Theoretical Computer Science 284(2), 467–485 (2002)CrossRefMATHMathSciNetGoogle Scholar
  30. 30.
    Wang, W., Yang, J., Muntz, R.: STING: A statistical information grid approach to spatial data mining. In: Proceedings of the 23rd International Conference on Very Large Data Bases, pp. 186–195 (1997)Google Scholar
  31. 31.
    Weihrauch, K.: Computable Analysis: An Introduction. Springer, Heidelberg (2000)CrossRefMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Mahito Sugiyama
    • 1
    • 2
  • Akihiro Yamamoto
    • 1
  1. 1.Graduate School of InformaticsKyoto UniversityKyotoJapan
  2. 2.Research Fellow of the Japan Society for the Promotion of ScienceJapan

Personalised recommendations