A General Approach to Clustering in Large Databases with Noise

Hinneburg, Alexander; Keim, Daniel A.

doi:10.1007/s10115-003-0086-9

A General Approach to Clustering in Large Databases with Noise

Published: November 2003

Volume 5, pages 387–415, (2003)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Alexander Hinneburg¹ &
Daniel A. Keim²

868 Accesses
106 Citations
Explore all metrics

Abstract.

Several clustering algorithms can be applied to clustering in large multimedia databases. The effectiveness and efficiency of the existing algorithms, however, are somewhat limited, since clustering in multimedia databases requires clustering of high-dimensional feature vectors and because multimedia databases often contain large amounts of noise. In this paper, we therefore introduce a new Kernel Density Estimation-based algorithm for clustering in large multimedia databases called DENCLUE (DENsity-based CLUstEring). Kernel Density Estimation (KDE) models the overall point density analytically as the sum of kernel (or influence) functions of the data points. Clusters can then be identified by determining density attractors and clusters of arbitrary shape can be easily described by a simple equation of the overall density function. The advantages of our KDE-based DENCLUE approach are: (1) it has a firm mathematical basis; (2) it has good clustering properties in data sets with large amounts of noise; (3) it allows a compact mathematical description of arbitrarily shaped clusters in high-dimensional data sets; and (4) it is significantly faster than existing algorithms. To demonstrate the effectiveness and efficiency of DENCLUE, we perform a series of experiments on a number of different data sets from CAD and molecular biology. A comparison with k-Means, DBSCAN, and BIRCH shows the superiority of our new algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Bock HH (1974) Automatic classification. Vandenhoeck and Ruprecht, Göttingen
Cutting DR, Pedersen JO, Karger DR, Tukey JW (1992) A cluster-based approach to browsing large document collections. In Proceedings of the 15th annual international ACM SIGIR conference on research and development in information retrieval. ACM Press, New York, pp 318–329
Daura X, Jaun B, Seebach D, van Gunsteren WF, Mark AE (1998) Reversible peptide folding in solution by molecular dynamics simulation. Journal of Molecular Biology 280: 925–932
Google Scholar
Ester M, Kriegel H, Xu X (1995) Knowledge discovery in large spatial databases: focusing techniques for efficient class identification. In SSD’95, fourth international symposium on large spatial databases. Lecture Notes in Computer Science 951, Springer, Berlin, pp 67–82
Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD’96, Proceedings of the second international conference on knowledge discovery and data mining. AAAI Press, Menlo Park, CA, pp 226–231
Ester M, Kriegel H-P, Sander J, Xu X (1997) Density-connected sets and their application for trend detection in spatial databases. In Proceedings of the third international conference on knowledge discovery and data mining (KDD-97). AAAI Press, Menlo Park, CA, pp 10–15
Fritzke B (1995) A growing neural gas network learns topologies. Advances in Neural Information Processing Systems 7. MIT Press, Cambridge, MA, pp 625–632
Fritzke B (1997) The LBG-U method for vector quantization: an improvement over LBG inspired from neural networks. Neural Processing Letters 5(1). Kluwer Publishers
Fukunaga K, Hostler L (1975) The estimation of the gradient of a density function, with application in pattern recognition. IEEE Transactions on Information Theory 21: 32–40
Google Scholar
Gersho A, Gray RM (1992) Vector quantization and signal compression. Kluwer, Dordrecht
Hafner JL, Sawhney HS, Equitz W, Flickner M, Niblack W (1995) Efficient color histogram indexing for quadratic form distance functions. IEEE Transactions on Pattern Analysis and Machine Intelligence 17(7): 729–736
Google Scholar
Härdle W, Scott DW (1992) Smoothing in low and high dimensions by weighted averaging using rounded points. Computational Statistics 7: 97–128
Google Scholar
Hinneburg A, Keim D (1998) An efficient approach to clustering in large multimedia databases with noise. In KDD’98, Proceedings of the fourth international conference on knowledge discovery and data mining. AAAI Press, Menlo Park, CA, pp 58–65
Jagadish HV (1991) A retrieval technique for similar shapes. In Proceedings of the 1991 ACM SIGMOD international conference on management of data, 1991. ACM Press, New York, pp 208–217
Kukich K (1992) Techniques for automatically correcting words in text. ACM Computing Surveys 24(4): 377–440
Google Scholar
Linde Y, Buzo A, Gray R (1980) An algorithm for vector quantizer. IEEE Transactions on Communications COM-28: 84–95
Google Scholar
Martinetz T, Schulten K (1993) A neural gas network learns topologies. Neural Networks 7:507–522
Google Scholar
Mehlhorn K, Näher S (1999) The LEDA platform of combinatorial and geometric computing. Cambridge University Press, Cambridge, UK
Mehrotra R, Gary JE (1995) Feature-index-based similar shape retrieval. In Proceedings of the third IFIP 2.6 working conference on visual database systems. VDB’95, pp 46–65
Nadaraya EA (1965) On nonparametric estimates of density functions and regression curves. Theory of Probability and its Applications 10: 186–190
Google Scholar
Ng RT, Han J (1994) Efficient and effective clustering methods for spatial data mining. In VLDB’94, Proceedings of 20th international conference on very large data bases, pp 144–155
Rojas R (1996) Neural networks: a systematic introduction. Springer, Berlin
Sahami M, Yusufali S, Baldonado MQW (1998) A service for organizing networked information autonomously. In Proceedings of the 3rd ACM international conference on digital libraries. ACM Press, New York, pp 200–209
Schikuta E, Erhart M (1997) The bang-clustering system: grid-based data analysis. In Lecture Notes in Computer Science 1280. Springer, Berlin, pp 513–524
Schnell P (1964) A method to find point-groups. Biometrika 6: 47–48
Google Scholar
Schuster EF (1970) Note on the uniform convergence of density estimates. Annals of Mathemathical Statistics 41: 1347–1348
Google Scholar
Scott DW (1985) Average shifted histograms: effective nonparametric density estimators in several dimensions. Annals of Statistics 13: 1024–1040
Google Scholar
Scott DW (1992) Multivariate density estimation. Wiley, New York
Scott DW, Sheather SJ (1985) Kernel density estimation with binned data. Communications in Statistics – Theory and Methods 14: 1353–1359
Sibson R (1973) Slink: an optimally efficient algorithm for the single-linkage cluster method. Computer Journal 16(1): 30–34
Google Scholar
Silverman BW (1982) Kernel density estimation using the fast fourier transformation. Applied Statistics 31: 93–99
Google Scholar
Silverman BW (1986) Density estimation for statistics and data analysis. Chapman & Hall, London
Wallace T, Winz P (1980) An efficient three-dimensional aircraft recognition algorithm using normalized fourier descriptors. Computer Graphics and Image Processing 13
Wand MP, Jones MC (1995) Kernel smoothing. Chapman & Hall, London
Wang W, Yang J, Muntz RR (1997) Sting: a statistical information grid approach to spatial data mining. In VLDB’97, Proceedings of 23rd international conference on very large data bases. pp 186–195
Weber R, Schek H-J, Blott S (1998) A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Gupta A, Shmueli O, Widom J (eds). VLDB’98, Proceedings of 24th international conference on very large data bases, 24–27 August, New York. Morgan Kaufmann, San Mates, CA, pp 194–205
Wishart D (1969) A numerical classification methods for deriving natural classes. Nature 221: 97–98
Google Scholar
Xu X, Ester M, Kriegel H-P, Sander J (1998) A distribution-based clustering algorithm for mining in large spatial databases. In Proceedings of the 14th international conference on data engineering, USA. IEEE Computer Society, Los Alamitos, CA, pp 324–331
Zhang T, Ramakrishnan R, Livny M (1996) Birch: an efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD international conference on management of data. ACM Press, New York, pp 103–114
Zhang T, Ramakrishnan R, Livny M (1997) Birch: a new data clustering algorithm and its applications. Data Mining and Knowledge Discovery 1(2): 141–182
Google Scholar
Zhang B, Hsu M, Dayal U (1999a) K-harmonic means: a data clustering algorithm. Technical Report HPL-1999-124, HP Research Labs
Zhang T, Ramakrishnan R, Livny M (1999b) Fast density estimation using cf-kernel for very large databases. In Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, pp 312–316

Download references

Author information

Authors and Affiliations

University of Halle, Institute of Computer Science, Halle (Saale), Germany
Alexander Hinneburg
AT&T, Research Labs and University of Constance, Konstanz, Germany
Daniel A. Keim

Authors

Alexander Hinneburg
View author publications
You can also search for this author in PubMed Google Scholar
Daniel A. Keim
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hinneburg, A., Keim, D. A General Approach to Clustering in Large Databases with Noise. Knowledge and Information Systems 5, 387–415 (2003). https://doi.org/10.1007/s10115-003-0086-9

Download citation

Received: 21 March 2001
Revised: 12 October 2001
Accepted: 24 January 2002
Issue Date: November 2003
DOI: https://doi.org/10.1007/s10115-003-0086-9

Keywords

Clustering algorithms; Clustering of high-dimensional data; Clustering in multi- media databases; Clustering in the presence of noise; Density-based clustering; Kernel Density Estimation

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A General Approach to Clustering in Large Databases with Noise

Abstract.

Access this article

Similar content being viewed by others

Clustering Large Datasets by Merging K-Means Solutions

Hierarchical Clustering for Large Data Sets

Density-Based Clustering Based on Hierarchical Density Estimates

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A General Approach to Clustering in Large Databases with Noise

Abstract.

Access this article

Similar content being viewed by others

Clustering Large Datasets by Merging K-Means Solutions

Hierarchical Clustering for Large Data Sets

Density-Based Clustering Based on Hierarchical Density Estimates

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation