Knowledge and Information Systems

, Volume 5, Issue 4, pp 387–415 | Cite as

A General Approach to Clustering in Large Databases with Noise

  • Alexander Hinneburg
  • Daniel A. Keim
Article

Abstract.

Several clustering algorithms can be applied to clustering in large multimedia databases. The effectiveness and efficiency of the existing algorithms, however, are somewhat limited, since clustering in multimedia databases requires clustering of high-dimensional feature vectors and because multimedia databases often contain large amounts of noise. In this paper, we therefore introduce a new Kernel Density Estimation-based algorithm for clustering in large multimedia databases called DENCLUE (DENsity-based CLUstEring). Kernel Density Estimation (KDE) models the overall point density analytically as the sum of kernel (or influence) functions of the data points. Clusters can then be identified by determining density attractors and clusters of arbitrary shape can be easily described by a simple equation of the overall density function. The advantages of our KDE-based DENCLUE approach are: (1) it has a firm mathematical basis; (2) it has good clustering properties in data sets with large amounts of noise; (3) it allows a compact mathematical description of arbitrarily shaped clusters in high-dimensional data sets; and (4) it is significantly faster than existing algorithms. To demonstrate the effectiveness and efficiency of DENCLUE, we perform a series of experiments on a number of different data sets from CAD and molecular biology. A comparison with k-Means, DBSCAN, and BIRCH shows the superiority of our new algorithm.

Keywords

Clustering algorithms; Clustering of high-dimensional data; Clustering in multi- media databases; Clustering in the presence of noise; Density-based clustering; Kernel Density Estimation 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bock HH (1974) Automatic classification. Vandenhoeck and Ruprecht, GöttingenGoogle Scholar
  2. 2.
    Cutting DR, Pedersen JO, Karger DR, Tukey JW (1992) A cluster-based approach to browsing large document collections. In Proceedings of the 15th annual international ACM SIGIR conference on research and development in information retrieval. ACM Press, New York, pp 318–329Google Scholar
  3. 3.
    Daura X, Jaun B, Seebach D, van Gunsteren WF, Mark AE (1998) Reversible peptide folding in solution by molecular dynamics simulation. Journal of Molecular Biology 280: 925–932Google Scholar
  4. 4.
    Ester M, Kriegel H, Xu X (1995) Knowledge discovery in large spatial databases: focusing techniques for efficient class identification. In SSD’95, fourth international symposium on large spatial databases. Lecture Notes in Computer Science 951, Springer, Berlin, pp 67–82Google Scholar
  5. 5.
    Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD’96, Proceedings of the second international conference on knowledge discovery and data mining. AAAI Press, Menlo Park, CA, pp 226–231Google Scholar
  6. 6.
    Ester M, Kriegel H-P, Sander J, Xu X (1997) Density-connected sets and their application for trend detection in spatial databases. In Proceedings of the third international conference on knowledge discovery and data mining (KDD-97). AAAI Press, Menlo Park, CA, pp 10–15Google Scholar
  7. 7.
    Fritzke B (1995) A growing neural gas network learns topologies. Advances in Neural Information Processing Systems 7. MIT Press, Cambridge, MA, pp 625–632Google Scholar
  8. 8.
    Fritzke B (1997) The LBG-U method for vector quantization: an improvement over LBG inspired from neural networks. Neural Processing Letters 5(1). Kluwer PublishersGoogle Scholar
  9. 9.
    Fukunaga K, Hostler L (1975) The estimation of the gradient of a density function, with application in pattern recognition. IEEE Transactions on Information Theory 21: 32–40Google Scholar
  10. 10.
    Gersho A, Gray RM (1992) Vector quantization and signal compression. Kluwer, DordrechtGoogle Scholar
  11. 11.
    Hafner JL, Sawhney HS, Equitz W, Flickner M, Niblack W (1995) Efficient color histogram indexing for quadratic form distance functions. IEEE Transactions on Pattern Analysis and Machine Intelligence 17(7): 729–736Google Scholar
  12. 12.
    Härdle W, Scott DW (1992) Smoothing in low and high dimensions by weighted averaging using rounded points. Computational Statistics 7: 97–128Google Scholar
  13. 13.
    Hinneburg A, Keim D (1998) An efficient approach to clustering in large multimedia databases with noise. In KDD’98, Proceedings of the fourth international conference on knowledge discovery and data mining. AAAI Press, Menlo Park, CA, pp 58–65Google Scholar
  14. 14.
    Jagadish HV (1991) A retrieval technique for similar shapes. In Proceedings of the 1991 ACM SIGMOD international conference on management of data, 1991. ACM Press, New York, pp 208–217Google Scholar
  15. 15.
    Kukich K (1992) Techniques for automatically correcting words in text. ACM Computing Surveys 24(4): 377–440Google Scholar
  16. 16.
    Linde Y, Buzo A, Gray R (1980) An algorithm for vector quantizer. IEEE Transactions on Communications COM-28: 84–95Google Scholar
  17. 17.
    Martinetz T, Schulten K (1993) A neural gas network learns topologies. Neural Networks 7:507–522Google Scholar
  18. 18.
    Mehlhorn K, Näher S (1999) The LEDA platform of combinatorial and geometric computing. Cambridge University Press, Cambridge, UKGoogle Scholar
  19. 19.
    Mehrotra R, Gary JE (1995) Feature-index-based similar shape retrieval. In Proceedings of the third IFIP 2.6 working conference on visual database systems. VDB’95, pp 46–65Google Scholar
  20. 20.
    Nadaraya EA (1965) On nonparametric estimates of density functions and regression curves. Theory of Probability and its Applications 10: 186–190Google Scholar
  21. 21.
    Ng RT, Han J (1994) Efficient and effective clustering methods for spatial data mining. In VLDB’94, Proceedings of 20th international conference on very large data bases, pp 144–155Google Scholar
  22. 22.
    Rojas R (1996) Neural networks: a systematic introduction. Springer, BerlinGoogle Scholar
  23. 23.
    Sahami M, Yusufali S, Baldonado MQW (1998) A service for organizing networked information autonomously. In Proceedings of the 3rd ACM international conference on digital libraries. ACM Press, New York, pp 200–209Google Scholar
  24. 24.
    Schikuta E, Erhart M (1997) The bang-clustering system: grid-based data analysis. In Lecture Notes in Computer Science 1280. Springer, Berlin, pp 513–524Google Scholar
  25. 25.
    Schnell P (1964) A method to find point-groups. Biometrika 6: 47–48Google Scholar
  26. 26.
    Schuster EF (1970) Note on the uniform convergence of density estimates. Annals of Mathemathical Statistics 41: 1347–1348Google Scholar
  27. 27.
    Scott DW (1985) Average shifted histograms: effective nonparametric density estimators in several dimensions. Annals of Statistics 13: 1024–1040Google Scholar
  28. 28.
    Scott DW (1992) Multivariate density estimation. Wiley, New YorkGoogle Scholar
  29. 29.
    Scott DW, Sheather SJ (1985) Kernel density estimation with binned data. Communications in Statistics – Theory and Methods 14: 1353–1359Google Scholar
  30. 30.
    Sibson R (1973) Slink: an optimally efficient algorithm for the single-linkage cluster method. Computer Journal 16(1): 30–34Google Scholar
  31. 31.
    Silverman BW (1982) Kernel density estimation using the fast fourier transformation. Applied Statistics 31: 93–99Google Scholar
  32. 32.
    Silverman BW (1986) Density estimation for statistics and data analysis. Chapman & Hall, LondonGoogle Scholar
  33. 33.
    Wallace T, Winz P (1980) An efficient three-dimensional aircraft recognition algorithm using normalized fourier descriptors. Computer Graphics and Image Processing 13Google Scholar
  34. 34.
    Wand MP, Jones MC (1995) Kernel smoothing. Chapman & Hall, LondonGoogle Scholar
  35. 35.
    Wang W, Yang J, Muntz RR (1997) Sting: a statistical information grid approach to spatial data mining. In VLDB’97, Proceedings of 23rd international conference on very large data bases. pp 186–195Google Scholar
  36. 36.
    Weber R, Schek H-J, Blott S (1998) A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Gupta A, Shmueli O, Widom J (eds). VLDB’98, Proceedings of 24th international conference on very large data bases, 24–27 August, New York. Morgan Kaufmann, San Mates, CA, pp 194–205Google Scholar
  37. 37.
    Wishart D (1969) A numerical classification methods for deriving natural classes. Nature 221: 97–98Google Scholar
  38. 38.
    Xu X, Ester M, Kriegel H-P, Sander J (1998) A distribution-based clustering algorithm for mining in large spatial databases. In Proceedings of the 14th international conference on data engineering, USA. IEEE Computer Society, Los Alamitos, CA, pp 324–331Google Scholar
  39. 39.
    Zhang T, Ramakrishnan R, Livny M (1996) Birch: an efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD international conference on management of data. ACM Press, New York, pp 103–114Google Scholar
  40. 40.
    Zhang T, Ramakrishnan R, Livny M (1997) Birch: a new data clustering algorithm and its applications. Data Mining and Knowledge Discovery 1(2): 141–182Google Scholar
  41. 41.
    Zhang B, Hsu M, Dayal U (1999a) K-harmonic means: a data clustering algorithm. Technical Report HPL-1999-124, HP Research LabsGoogle Scholar
  42. 42.
    Zhang T, Ramakrishnan R, Livny M (1999b) Fast density estimation using cf-kernel for very large databases. In Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, pp 312–316Google Scholar

Copyright information

© Springer-Verlag London Limited 2003

Authors and Affiliations

  • Alexander Hinneburg
    • 1
  • Daniel A. Keim
    • 2
  1. 1.University of HalleInstitute of Computer ScienceHalle (Saale)Germany
  2. 2.AT&TResearch Labs and University of ConstanceKonstanzGermany

Personalised recommendations