Skip to main content
Log in

A General Approach to Clustering in Large Databases with Noise

  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract.

Several clustering algorithms can be applied to clustering in large multimedia databases. The effectiveness and efficiency of the existing algorithms, however, are somewhat limited, since clustering in multimedia databases requires clustering of high-dimensional feature vectors and because multimedia databases often contain large amounts of noise. In this paper, we therefore introduce a new Kernel Density Estimation-based algorithm for clustering in large multimedia databases called DENCLUE (DENsity-based CLUstEring). Kernel Density Estimation (KDE) models the overall point density analytically as the sum of kernel (or influence) functions of the data points. Clusters can then be identified by determining density attractors and clusters of arbitrary shape can be easily described by a simple equation of the overall density function. The advantages of our KDE-based DENCLUE approach are: (1) it has a firm mathematical basis; (2) it has good clustering properties in data sets with large amounts of noise; (3) it allows a compact mathematical description of arbitrarily shaped clusters in high-dimensional data sets; and (4) it is significantly faster than existing algorithms. To demonstrate the effectiveness and efficiency of DENCLUE, we perform a series of experiments on a number of different data sets from CAD and molecular biology. A comparison with k-Means, DBSCAN, and BIRCH shows the superiority of our new algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Bock HH (1974) Automatic classification. Vandenhoeck and Ruprecht, Göttingen

  2. Cutting DR, Pedersen JO, Karger DR, Tukey JW (1992) A cluster-based approach to browsing large document collections. In Proceedings of the 15th annual international ACM SIGIR conference on research and development in information retrieval. ACM Press, New York, pp 318–329

  3. Daura X, Jaun B, Seebach D, van Gunsteren WF, Mark AE (1998) Reversible peptide folding in solution by molecular dynamics simulation. Journal of Molecular Biology 280: 925–932

    Google Scholar 

  4. Ester M, Kriegel H, Xu X (1995) Knowledge discovery in large spatial databases: focusing techniques for efficient class identification. In SSD’95, fourth international symposium on large spatial databases. Lecture Notes in Computer Science 951, Springer, Berlin, pp 67–82

  5. Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD’96, Proceedings of the second international conference on knowledge discovery and data mining. AAAI Press, Menlo Park, CA, pp 226–231

  6. Ester M, Kriegel H-P, Sander J, Xu X (1997) Density-connected sets and their application for trend detection in spatial databases. In Proceedings of the third international conference on knowledge discovery and data mining (KDD-97). AAAI Press, Menlo Park, CA, pp 10–15

  7. Fritzke B (1995) A growing neural gas network learns topologies. Advances in Neural Information Processing Systems 7. MIT Press, Cambridge, MA, pp 625–632

  8. Fritzke B (1997) The LBG-U method for vector quantization: an improvement over LBG inspired from neural networks. Neural Processing Letters 5(1). Kluwer Publishers

  9. Fukunaga K, Hostler L (1975) The estimation of the gradient of a density function, with application in pattern recognition. IEEE Transactions on Information Theory 21: 32–40

    Google Scholar 

  10. Gersho A, Gray RM (1992) Vector quantization and signal compression. Kluwer, Dordrecht

  11. Hafner JL, Sawhney HS, Equitz W, Flickner M, Niblack W (1995) Efficient color histogram indexing for quadratic form distance functions. IEEE Transactions on Pattern Analysis and Machine Intelligence 17(7): 729–736

    Google Scholar 

  12. Härdle W, Scott DW (1992) Smoothing in low and high dimensions by weighted averaging using rounded points. Computational Statistics 7: 97–128

    Google Scholar 

  13. Hinneburg A, Keim D (1998) An efficient approach to clustering in large multimedia databases with noise. In KDD’98, Proceedings of the fourth international conference on knowledge discovery and data mining. AAAI Press, Menlo Park, CA, pp 58–65

  14. Jagadish HV (1991) A retrieval technique for similar shapes. In Proceedings of the 1991 ACM SIGMOD international conference on management of data, 1991. ACM Press, New York, pp 208–217

  15. Kukich K (1992) Techniques for automatically correcting words in text. ACM Computing Surveys 24(4): 377–440

    Google Scholar 

  16. Linde Y, Buzo A, Gray R (1980) An algorithm for vector quantizer. IEEE Transactions on Communications COM-28: 84–95

    Google Scholar 

  17. Martinetz T, Schulten K (1993) A neural gas network learns topologies. Neural Networks 7:507–522

    Google Scholar 

  18. Mehlhorn K, Näher S (1999) The LEDA platform of combinatorial and geometric computing. Cambridge University Press, Cambridge, UK

  19. Mehrotra R, Gary JE (1995) Feature-index-based similar shape retrieval. In Proceedings of the third IFIP 2.6 working conference on visual database systems. VDB’95, pp 46–65

  20. Nadaraya EA (1965) On nonparametric estimates of density functions and regression curves. Theory of Probability and its Applications 10: 186–190

    Google Scholar 

  21. Ng RT, Han J (1994) Efficient and effective clustering methods for spatial data mining. In VLDB’94, Proceedings of 20th international conference on very large data bases, pp 144–155

  22. Rojas R (1996) Neural networks: a systematic introduction. Springer, Berlin

  23. Sahami M, Yusufali S, Baldonado MQW (1998) A service for organizing networked information autonomously. In Proceedings of the 3rd ACM international conference on digital libraries. ACM Press, New York, pp 200–209

  24. Schikuta E, Erhart M (1997) The bang-clustering system: grid-based data analysis. In Lecture Notes in Computer Science 1280. Springer, Berlin, pp 513–524

  25. Schnell P (1964) A method to find point-groups. Biometrika 6: 47–48

    Google Scholar 

  26. Schuster EF (1970) Note on the uniform convergence of density estimates. Annals of Mathemathical Statistics 41: 1347–1348

    Google Scholar 

  27. Scott DW (1985) Average shifted histograms: effective nonparametric density estimators in several dimensions. Annals of Statistics 13: 1024–1040

    Google Scholar 

  28. Scott DW (1992) Multivariate density estimation. Wiley, New York

  29. Scott DW, Sheather SJ (1985) Kernel density estimation with binned data. Communications in Statistics – Theory and Methods 14: 1353–1359

  30. Sibson R (1973) Slink: an optimally efficient algorithm for the single-linkage cluster method. Computer Journal 16(1): 30–34

    Google Scholar 

  31. Silverman BW (1982) Kernel density estimation using the fast fourier transformation. Applied Statistics 31: 93–99

    Google Scholar 

  32. Silverman BW (1986) Density estimation for statistics and data analysis. Chapman & Hall, London

  33. Wallace T, Winz P (1980) An efficient three-dimensional aircraft recognition algorithm using normalized fourier descriptors. Computer Graphics and Image Processing 13

  34. Wand MP, Jones MC (1995) Kernel smoothing. Chapman & Hall, London

  35. Wang W, Yang J, Muntz RR (1997) Sting: a statistical information grid approach to spatial data mining. In VLDB’97, Proceedings of 23rd international conference on very large data bases. pp 186–195

  36. Weber R, Schek H-J, Blott S (1998) A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Gupta A, Shmueli O, Widom J (eds). VLDB’98, Proceedings of 24th international conference on very large data bases, 24–27 August, New York. Morgan Kaufmann, San Mates, CA, pp 194–205

  37. Wishart D (1969) A numerical classification methods for deriving natural classes. Nature 221: 97–98

    Google Scholar 

  38. Xu X, Ester M, Kriegel H-P, Sander J (1998) A distribution-based clustering algorithm for mining in large spatial databases. In Proceedings of the 14th international conference on data engineering, USA. IEEE Computer Society, Los Alamitos, CA, pp 324–331

  39. Zhang T, Ramakrishnan R, Livny M (1996) Birch: an efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD international conference on management of data. ACM Press, New York, pp 103–114

  40. Zhang T, Ramakrishnan R, Livny M (1997) Birch: a new data clustering algorithm and its applications. Data Mining and Knowledge Discovery 1(2): 141–182

    Google Scholar 

  41. Zhang B, Hsu M, Dayal U (1999a) K-harmonic means: a data clustering algorithm. Technical Report HPL-1999-124, HP Research Labs

  42. Zhang T, Ramakrishnan R, Livny M (1999b) Fast density estimation using cf-kernel for very large databases. In Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, pp 312–316

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hinneburg, A., Keim, D. A General Approach to Clustering in Large Databases with Noise. Knowledge and Information Systems 5, 387–415 (2003). https://doi.org/10.1007/s10115-003-0086-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-003-0086-9

Keywords

Navigation