# Using Self-Similarity to Cluster Large Data Sets

- 318 Downloads
- 28 Citations

## Abstract

Clustering is a widely used knowledge discovery technique. It helps uncovering structures in data that were not previously known. The clustering of large data sets has received a lot of attention in recent years, however, clustering is a still a challenging task since many published algorithms fail to do well in scaling with the size of the data set and the number of dimensions that describe the points, or in finding arbitrary shapes of clusters, or dealing effectively with the presence of noise. In this paper, we present a new clustering algorithm, based in self-similarity properties of the data sets. Self-similarity is the property of being invariant with respect to the scale used to look at the data set. While fractals are self-similar at every scale used to look at them, many data sets exhibit self-similarity over a range of scales. Self-similarity can be measured using the fractal dimension. The new algorithm which we call Fractal Clustering (FC) places points incrementally in the cluster for which the change in the fractal dimension after adding the point is the least. This is a very natural way of clustering points, since points in the same cluster have a great degree of self-similarity among them (and much less self-similarity with respect to points in other clusters). FC requires one scan of the data, is suspendable at will, providing the best answer possible at that point, and is incremental. We show via experiments that FC effectively deals with large data sets, high-dimensionality and noise and is capable of recognizing clusters of arbitrary shape.

## Preview

Unable to display preview. Download preview PDF.

## References

- Backer, E. 1995. Computer-Assisted Reasoning in Cluster Analysis. Prentice Hall.Google Scholar
- Belussi, A. and Faloutsos, C. 1995. Estimating the selectivity of spatial queries using the ‘Correlation’ fractal dimension. In Proceedings of the International Conference on Very Large Data Bases, pp. 299–310.Google Scholar
- Bradley, P.S., Fayyad, U., and Reina, C. 1998. Scaling clustering algorithms to large databases. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, New York City.Google Scholar
- Bradley, P.S., Fayyad, U., and Reina, C. 1998. Scaling clustering algorithms to large databases (extended abstract). In Proceedings of the ACMSIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.Google Scholar
- Carbon Dioxide Information Analysis Center. Contributor: Yi-Fan, Li. 1990. Global population distribution. URLhttp://cdiac.esd.ornl.gov/ftp/db1016/.Google Scholar
- Chernoff, H. 1952. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Annals of Mathematical Statistics, 23:493–509.Google Scholar
- Domingo, C., Gavaldá, R., and Watanabe, O. 1998.Practical algorithms for online selection. In Proceedings of the first International Conference on Discovery Science.Google Scholar
- Domingo, C., Gavaldá, R., and Watanabe, O. 2000. Adaptive sampling algorithms for scaling up knowledge discovery algorithms.Discovery Science, 1999:172–183.Google Scholar
- Domingos, P. and Hulten, G. 2000. Mining high-speed data streams. In Proceedings of the first 2000 Conference on Knowledge Discovery and Data Mining, pp. 71–80.Google Scholar
- Ester, M., Kriegel, J.P., Sander, J., and Su, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pp. 226–231.Google Scholar
- Faloutsos, C. and Gaede, V. 1996. Analysis of the Z-ordering method using the Hausdorff fractal dimension. In Proceedings of the International Conference on Very Large Data Bases, pp. 40–50.Google Scholar
- Faloutsos, C. and Kamel, I. 1997. Relaxing the uniformity and independence assumptions, using the concept of fractal dimensions. Journal of Computer and System Sciences, 55(2):229–240.Google Scholar
- Faloutsos, C., Matias, Y., and Silberschatz, A. 1996. Modeling skewed distributions using multifractals and the ‘80-20 law’. In Proceedings of the International Conference on Very Large Data Bases, pp. 307–317.Google Scholar
- Fisher, D.H. 1996. Iterative optimization and simplification of hierarchical clusterings. Journal of AI Research,4:147–180.Google Scholar
- Fukunaga, K. 1990. Introduction to Statistical Pattern Recognition. San Diego, California: Academic Press.Google Scholar
- Gluck, M.A. and Corter, J.E. 1985. Information, uncertainty, and the utility of categories. In Proceedings of the Seventh Annual Conference of the Cognitive Science Society, Irvine, CA.Google Scholar
- Grassberger, P. 1983. Generalized dimensions of strange attractors. Physics Letters, 97A:227–230.Google Scholar
- Grassberger, P. and Procaccia, I. 1983. Characterization of strange attractors. Physical Review Letters, 50(5):346–349.Google Scholar
- Guha, S., Rastogi, R., and Shim, K. 1998. CURE: An efficient clustering algorithm for large databases. In Proceedings of the ACM SIGMOD Conference on Management of Data, Seattle, Washington, pp. 73–84.Google Scholar
- Hinneburg, A. and Keim, D. 1999. Clustering techniques for large data sets: From the past to the future. Tutorial Notes for ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Google Scholar
- Jain, A. and Dubes, R.C. 1988. Algorithms for Clustering Data. Englewood Cliffs, New Jersey: Prentice Hall.Google Scholar
- Lauritzen, S.L. 1995. The EM algorithm for graphical association models with missing data. Computational Statistics and Data Analysis, 19:101–201.Google Scholar
- Liebovitch, L.S. and Toth, T. 1989. A fast algorithm to determine fractal dimensions by box counting. Physics Letters, A141:386–390.Google Scholar
- Lipton, R.J. and Naughton, J.F. 1995.Query size estimation by adaptive sampling. Journal of Computer Systems Science, 51:18–25.Google Scholar
- Lipton, R.J., Naughton, J.F., Schneider, D.A., and Seshadri, S. 1993. Efficient sampling strategies for relational database operations. Theoretical Computer Science, 116:195–226.Google Scholar
- Mandelbrot, B.B. 1983. The Fractal Geometry of Nature. New York: Freeman.Google Scholar
- Ng, R.T. and Han, J. 1994. Efficient and effective clustering methods for spatial data mining. In Proceedings of the 20
^{th}Very Large Data Bases Conference, pp. 144–155.Google Scholar - Samet, H. 1990. Applications of Spatial Data Structures. Addison-Wesley.Google Scholar
- Sarraille, J. and DiFalco, P. FD3. http://tori.postech.ac.kr/softwares/.Google Scholar
- Schikuta, E. 1996. Grid clustering: An efficient hierarchical method for very large data sets. In Proceedings of the 13
^{th}Conference on Pattern Recognition, IEEE Computer Society Press, pp. 101–105.Google Scholar - Schroeder, M. 1991.Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise. New York: W.H. Freeman.Google Scholar
- Selim, S.Z. and Ismail, M.A. 1984. K-means-type Algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-6(1).Google Scholar
- Sheikholeslami, G., Chatterjee, S., and Zhang, A. 1998. WaveCluster: A multi-resolution clustering approach for very large spatial databases. In Proceedings of the 24
^{th}Very Large Data Bases Conference, pp. 428–439.Google Scholar - Wang, W., Yand, J., and Muntz, R. 1997. STING: A statistical information grid approach to spatial data mining. In Proceedings of the 23
^{rd}Very Large Data Bases Conference, pp. 186–195.Google Scholar - Watanabe, O. 2000. Simple sampling techniques for discovery science. IEICE Transactions on Information and Systems.Google Scholar
- Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: A efficient data clustering method for very large databases. In Proceedings of the ACM SIGMOD Conference on Management of Data, Montreal, Canada, pp. 103–114.Google Scholar