Advertisement

Data Mining and Knowledge Discovery

, Volume 7, Issue 2, pp 123–152 | Cite as

Using Self-Similarity to Cluster Large Data Sets

  • Daniel Barbará
  • Ping Chen
Article

Abstract

Clustering is a widely used knowledge discovery technique. It helps uncovering structures in data that were not previously known. The clustering of large data sets has received a lot of attention in recent years, however, clustering is a still a challenging task since many published algorithms fail to do well in scaling with the size of the data set and the number of dimensions that describe the points, or in finding arbitrary shapes of clusters, or dealing effectively with the presence of noise. In this paper, we present a new clustering algorithm, based in self-similarity properties of the data sets. Self-similarity is the property of being invariant with respect to the scale used to look at the data set. While fractals are self-similar at every scale used to look at them, many data sets exhibit self-similarity over a range of scales. Self-similarity can be measured using the fractal dimension. The new algorithm which we call Fractal Clustering (FC) places points incrementally in the cluster for which the change in the fractal dimension after adding the point is the least. This is a very natural way of clustering points, since points in the same cluster have a great degree of self-similarity among them (and much less self-similarity with respect to points in other clusters). FC requires one scan of the data, is suspendable at will, providing the best answer possible at that point, and is incremental. We show via experiments that FC effectively deals with large data sets, high-dimensionality and noise and is capable of recognizing clusters of arbitrary shape.

clustering self-similarity scalability 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Backer, E. 1995. Computer-Assisted Reasoning in Cluster Analysis. Prentice Hall.Google Scholar
  2. Belussi, A. and Faloutsos, C. 1995. Estimating the selectivity of spatial queries using the ‘Correlation’ fractal dimension. In Proceedings of the International Conference on Very Large Data Bases, pp. 299–310.Google Scholar
  3. Bradley, P.S., Fayyad, U., and Reina, C. 1998. Scaling clustering algorithms to large databases. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, New York City.Google Scholar
  4. Bradley, P.S., Fayyad, U., and Reina, C. 1998. Scaling clustering algorithms to large databases (extended abstract). In Proceedings of the ACMSIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.Google Scholar
  5. Carbon Dioxide Information Analysis Center. Contributor: Yi-Fan, Li. 1990. Global population distribution. URLhttp://cdiac.esd.ornl.gov/ftp/db1016/.Google Scholar
  6. Chernoff, H. 1952. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Annals of Mathematical Statistics, 23:493–509.Google Scholar
  7. Domingo, C., Gavaldá, R., and Watanabe, O. 1998.Practical algorithms for online selection. In Proceedings of the first International Conference on Discovery Science.Google Scholar
  8. Domingo, C., Gavaldá, R., and Watanabe, O. 2000. Adaptive sampling algorithms for scaling up knowledge discovery algorithms.Discovery Science, 1999:172–183.Google Scholar
  9. Domingos, P. and Hulten, G. 2000. Mining high-speed data streams. In Proceedings of the first 2000 Conference on Knowledge Discovery and Data Mining, pp. 71–80.Google Scholar
  10. Ester, M., Kriegel, J.P., Sander, J., and Su, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pp. 226–231.Google Scholar
  11. Faloutsos, C. and Gaede, V. 1996. Analysis of the Z-ordering method using the Hausdorff fractal dimension. In Proceedings of the International Conference on Very Large Data Bases, pp. 40–50.Google Scholar
  12. Faloutsos, C. and Kamel, I. 1997. Relaxing the uniformity and independence assumptions, using the concept of fractal dimensions. Journal of Computer and System Sciences, 55(2):229–240.Google Scholar
  13. Faloutsos, C., Matias, Y., and Silberschatz, A. 1996. Modeling skewed distributions using multifractals and the ‘80-20 law’. In Proceedings of the International Conference on Very Large Data Bases, pp. 307–317.Google Scholar
  14. Fisher, D.H. 1996. Iterative optimization and simplification of hierarchical clusterings. Journal of AI Research,4:147–180.Google Scholar
  15. Fukunaga, K. 1990. Introduction to Statistical Pattern Recognition. San Diego, California: Academic Press.Google Scholar
  16. Gluck, M.A. and Corter, J.E. 1985. Information, uncertainty, and the utility of categories. In Proceedings of the Seventh Annual Conference of the Cognitive Science Society, Irvine, CA.Google Scholar
  17. Grassberger, P. 1983. Generalized dimensions of strange attractors. Physics Letters, 97A:227–230.Google Scholar
  18. Grassberger, P. and Procaccia, I. 1983. Characterization of strange attractors. Physical Review Letters, 50(5):346–349.Google Scholar
  19. Guha, S., Rastogi, R., and Shim, K. 1998. CURE: An efficient clustering algorithm for large databases. In Proceedings of the ACM SIGMOD Conference on Management of Data, Seattle, Washington, pp. 73–84.Google Scholar
  20. Hinneburg, A. and Keim, D. 1999. Clustering techniques for large data sets: From the past to the future. Tutorial Notes for ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Google Scholar
  21. Jain, A. and Dubes, R.C. 1988. Algorithms for Clustering Data. Englewood Cliffs, New Jersey: Prentice Hall.Google Scholar
  22. Lauritzen, S.L. 1995. The EM algorithm for graphical association models with missing data. Computational Statistics and Data Analysis, 19:101–201.Google Scholar
  23. Liebovitch, L.S. and Toth, T. 1989. A fast algorithm to determine fractal dimensions by box counting. Physics Letters, A141:386–390.Google Scholar
  24. Lipton, R.J. and Naughton, J.F. 1995.Query size estimation by adaptive sampling. Journal of Computer Systems Science, 51:18–25.Google Scholar
  25. Lipton, R.J., Naughton, J.F., Schneider, D.A., and Seshadri, S. 1993. Efficient sampling strategies for relational database operations. Theoretical Computer Science, 116:195–226.Google Scholar
  26. Mandelbrot, B.B. 1983. The Fractal Geometry of Nature. New York: Freeman.Google Scholar
  27. Ng, R.T. and Han, J. 1994. Efficient and effective clustering methods for spatial data mining. In Proceedings of the 20th Very Large Data Bases Conference, pp. 144–155.Google Scholar
  28. Samet, H. 1990. Applications of Spatial Data Structures. Addison-Wesley.Google Scholar
  29. Sarraille, J. and DiFalco, P. FD3. http://tori.postech.ac.kr/softwares/.Google Scholar
  30. Schikuta, E. 1996. Grid clustering: An efficient hierarchical method for very large data sets. In Proceedings of the 13th Conference on Pattern Recognition, IEEE Computer Society Press, pp. 101–105.Google Scholar
  31. Schroeder, M. 1991.Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise. New York: W.H. Freeman.Google Scholar
  32. Selim, S.Z. and Ismail, M.A. 1984. K-means-type Algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-6(1).Google Scholar
  33. Sheikholeslami, G., Chatterjee, S., and Zhang, A. 1998. WaveCluster: A multi-resolution clustering approach for very large spatial databases. In Proceedings of the 24thVery Large Data Bases Conference, pp. 428–439.Google Scholar
  34. Wang, W., Yand, J., and Muntz, R. 1997. STING: A statistical information grid approach to spatial data mining. In Proceedings of the 23rd Very Large Data Bases Conference, pp. 186–195.Google Scholar
  35. Watanabe, O. 2000. Simple sampling techniques for discovery science. IEICE Transactions on Information and Systems.Google Scholar
  36. Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: A efficient data clustering method for very large databases. In Proceedings of the ACM SIGMOD Conference on Management of Data, Montreal, Canada, pp. 103–114.Google Scholar

Copyright information

© Kluwer Academic Publishers 2003

Authors and Affiliations

  • Daniel Barbará
    • 1
  • Ping Chen
    • 2
  1. 1.ISE DepartmentGeorge Mason University, FairfaxVirginiaUSA
  2. 2.Computer and Mathematical Science DepartmentUniversity of Houston-DowntownHoustonUSA

Personalised recommendations