Advertisement

Data Mining and Knowledge Discovery

, Volume 1, Issue 2, pp 141–182 | Cite as

BIRCH: A New Data Clustering Algorithm and Its Applications

  • Tian Zhang
  • Raghu Ramakrishnan
  • Miron Livny
Article

Abstract

Data clustering is an important technique for exploratory data analysis, and has been studied for several years. It has been shown to be useful in many practical domains such as data classification and image processing. Recently, there has been a growing emphasis on exploratory analysis of very large datasets to discover useful patterns and/or correlations among attributes. This is called data mining, and data clustering is regarded as a particular branch. However existing data clustering methods do not adequately address the problem of processing large datasets with a limited amount of resources (e.g., memory and cpu cycles). So as the dataset size increases, they do not scale up well in terms of memory requirement, running time, and result quality.

In this paper, an efficient and scalable data clustering method is proposed, based on a new in-memory data structure called CF-tree, which serves as an in-memory summary of the data distribution. We have implemented it in a system called BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and studied its performance extensively in terms of memory requirements, running time, clustering quality, stability and scalability; we also compare it with other available methods. Finally, BIRCH is applied to solve two real-life problems: one is building an iterative and interactive pixel classification tool, and the other is generating the initial codebook for image compression.

Very Large Databases Data Clustering Incremental Algorithm Data Classification and Compression 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Beckmann, Norbert, Kriegel, Hans-Peter, Schneider, Ralf and Seeger Bernhard, The R*-tree: An Efficient and Robust Access Method for Points and Rectangles, Proc. of ACM SIGMOD Int. Conf. on Management of Data, 322–331,1990.Google Scholar
  2. Cheeseman, Peter, Kelly, James, Self, Matthew, et al., AutoClass: A Bayesian Classification System, Proc. of the 5th Int. Conf. on Machine Learning, Morgan Kaufman, Jun. 1988.Google Scholar
  3. Cheng, Michael, Livny, Miron and Ramakrishnan, Raghu, Visual Analysis of Stream Data, Proc. of IS&T/SPIE Conf. on Visual Data Exploration and Analysis, San Jose, CA, Feb. 1995.Google Scholar
  4. Duda, Richard and Hart Peter E., Pattern Classification and Scene Analysis, Wiley, 1973.Google Scholar
  5. Dubes, R. and Jain, A.K., Clustering Methodologies in Exploratory Data Analysis, Advances in Computers, Edited by M.C. Yovits, Vol. 19, Academic Press, New York, 1980.Google Scholar
  6. Ester, Martin, Kriegel, Hans-Peter and Xu, Xiaowei, A Database Interface for Clustering in Large Spatial Databases, Proc. of 1st Int. Conf. on Knowledge Discovery and Data Mining, 1995a.Google Scholar
  7. Ester, Martin, Kriegel, Hans-Peter and Xu, Xiaowei, Knowledge Discovery in Large Spatial Databases: Focusing Techniques for Efficient Class Identification, Proc. of 4th Int. Symposium on Large Spatial Databases, Portland, Maine, U.S.A., 1995b.Google Scholar
  8. Feigenbaum, E.A. and Simon, H., EPAM-like models of recognition and learning, Cognitive Science, vol. 8, 1984, 305–336.Google Scholar
  9. Fisher, Douglas H., Knowledge Acquisition via Incremental Conceptual Clustering, Machine Learning, 2(2), 1987Google Scholar
  10. Fisher, Douglas H., Iterative Optimization and Simplification of Hierarchical Clusterings, Technical Report CS-95-01, Dept. of Computer Science, Vanderbilt University, Nashville, TN 37235, 1995.Google Scholar
  11. Gersho, A. and Gray, R., Vector quantization and signal compression, Boston, Ma.: Kluwer Academic Publishers, 1992.Google Scholar
  12. Gennari, John H., Langley, Pat and Fisher, Douglas, Models of Incremental Concept Formation, Artificial Intelligence, vol. 40, 1989, 11–61.Google Scholar
  13. Guttman, A., R-trees: a dynamic index structure for spatial searching, Proc. ACM SIGMOD Int. Conf. on Management of Data, 47–57, 1984.Google Scholar
  14. Huang, C., Bi, Q., Stiles, G. and Harris, R., Fast Full Search Equivalent Encoding Algorithms for Image Compression Using Vector Quantization, IEEE Trans. on Image Processing, vol. 1, no. 3, July, 1992.Google Scholar
  15. Hartigan, J.A. and Wong, M.A., A K-Means Clustering Algorithm, Appl. Statist., vol. 28, no. 1, 1979.Google Scholar
  16. Kaufman, Leonard and Rousseeuw, Peter J., Finding Groups in Data-An Introduction to Cluster Analysis,Wiley Series in Probability and Mathematical Statistics, 1990.Google Scholar
  17. Kucharik, C.J. and Norman, J.M., Measuring Canopy Architecture with a Multiband Vegetation Imager (MVI) Proc. of the 22nd conf. on Agricultural and Forest Meteorology, American Meteorological Society annual meeting, Atlanta, GA, Jan 28-Feb 2, 1996.Google Scholar
  18. Kucharik, C.J., Norman, J.M., Murdock, L.M. and Gower, S.T., Characterizing Canopy non-randomness with a Multiband Vegetation Imager (MVI), Submitted to Journal of Geophysical Research, to appear in the Boreal Ecosystem-Atmosphere Study (BOREAS) special issue, 1996.Google Scholar
  19. Kou, Weidong, Digital Image Compression Algorithms and Standards, Kluwer Academic Publishers, 1995.Google Scholar
  20. Linde, Y., Buzo, A. and Gray, R.M., An Algorithm for Vector Quantization Design, IEEE Trans. on Communications, vol. 28, no. 1, 1980.Google Scholar
  21. Lebowitz, Michael, Experiments with Incremental Concept Formation: UNIMEM, Machine Learning, 1987.Google Scholar
  22. Lee, R.C.T., Clustering analysis and its applications, Advances in Information Systems Science, Edited by J.T. Toum, Vol. 8, pp. 169-292, Plenum Press, New York, 1981.Google Scholar
  23. Murtagh, F., A Survey of Recent Advances in Hierarchical Clustering Algorithms, The Computer Journal, 1983.Google Scholar
  24. Ng, Raymond T. and Han, Jiawei, Efficient and Effective Clustering Methods for Spatial Data Mining, Proc. of VLDB, 1994.Google Scholar
  25. Olson, Clark F., Parallel Algorithms for Hierarchical Clustering, Technical Report, Computer Science Division, Univ. of California at Berkeley, Dec.,1993.Google Scholar
  26. Rabbani, Majid and Jones, Paul W. Digital Image Compression Techniques, SPIE Optical Engineering Press, 1991.Google Scholar
  27. Zhang, Tian, Ramakrishnan, Raghu and Livny, Miron, BIRCH: An Efficient Data Clustering Method for Very Large Databases, Technical Report, Computer Sciences Dept., Univ. of Wisconsin-Madison, 1995.Google Scholar
  28. Zhang, Tian, Data Clustering for Very Large Datasets Plus Applications, Dissertation, Computer Sciences Dept. at Univ. of Wisconsin-Madison, 1996.Google Scholar

Copyright information

© Kluwer Academic Publishers 1997

Authors and Affiliations

  • Tian Zhang
    • 1
  • Raghu Ramakrishnan
    • 1
  • Miron Livny
    • 1
  1. 1.Computer Sciences DepartmentUniversity of WisconsinMadisonU.S.A.

Personalised recommendations