Benchmarking in Cluster Analysis: A Study on Spectral Clustering, DBSCAN, and K-Means

Murugesan, Nivedha; Cho, Irene; Tortora, Cristina

doi:10.1007/978-3-030-60104-1_20

Nivedha Murugesan²³,
Irene Cho²³ &
Cristina Tortora²³

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

Included in the following conference series:

Conference of the International Federation of Classification Societies

1058 Accesses
5 Citations

Abstract

We perform a benchmarking study to identify the advantages and the drawbacks of Spectral Clustering and Density-Based Spatial Clustering of Applications with Noise (DBSCAN). We compare the two methods with the classic K-means clustering. The methods are performed on five simulated and three real data sets. The obtained clustering results are compared using external and internal indices, as well as run times. Although there is not one method that performs best on all types of data sets, we find that DBSCAN should generally be reserved for non-convex data with well-separated clusters or for data with many outliers. Spectral Clustering has better overall performance but with higher instability of the results compared to K-means, and longer run time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Performed on a MacBook Pro 2017 - Processor: 2.3 GHz Intel Core i5; Memory: 8GB.
2.
Link: https://irene1014.github.io/benchmarking-study-cluster-analysis/.

References

Barndorff-Nielsen, O.E.: Exponentially decreasing distributions for the logarithm of particle size. Proc. R. Soc. Lond. A. 353(1674), 401–419 (1977)
Google Scholar
Becker, R.A., Wilks, A.R.: (Orig. S code), R. Brownrigg (R Version), T.P. Minka and A. Deckmyn (Enhancements), maps: Draw Geographical Maps. R package v. 3.3.0 (2018)
Google Scholar
Becker, R.A., Wilks A.R.: (Original S code), R. Brownrigg (R Version), mapdata: Extra Map Databases. R package version 2.3.0. (2018)
Google Scholar
Berhane, F.: Data distributions where Kmeans clustering fails: Can DBSCAN be a solution? https://datascience-enthusiast.com/Python/DBSCAN_Kmeans.html (2020)
Desgraupes, B.: clusterCrit: Clustering Indices. R package version 1.2.8. (2018)
Google Scholar
Emadi, H.S., Mazinani, S.M.: A novel anomaly detection algorithm using DBSCAN and SVM in wireless sensor networks. Wirel. Pers Comm. 98, 2025–2035 (2018)
Google Scholar
Ester, M., Kriegel, H., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD-96 Proceedings, pp. 226-231 (1996)
Google Scholar
Fitch, J.P., Khan, N.A.M., Tortora, C.: Back pain: a spectral clustering approach. Arch. Data Sci. Ser. B 1(1) (online first) (2019)
Google Scholar
Francis, Z., Villagrasa, C., Clairand, I.: Simulation of DNA damage clustering after proton irradiation using an adapted DBSCAN algorithm. Comput. Methods Programs Biomed. 101(3), 265–270 (2011)
Google Scholar
Genz, A., Bretz, F., Miwa, T., Mi, X., Leisch, F., Scheipl, F., Hothorn, T.: mvtnorm: Multivariate Normal and t Distributions, R package version 1.0-11 (2019)
Google Scholar
Hahsler, M., Piekenbrock, M.: dbscan: Density Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms, R package version 1.1-1 (2017)
Google Scholar
Hornik, K., Grün, B.: movMF: An R package for fitting mixtures of von mises-fisher distributions. J. Stat. Soft. 58(10), 1–31 (2014)
Google Scholar
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)
Google Scholar
Kahle, D., Wickham, H.: ggmap: spatial visualization with ggplot2. R J. 5(1), 144–161 (2013)
Google Scholar
Karatzoglou, A., Smola, A., Hornik, K., Zeileis, A.: kernlab—An S4 Package for kernel methods in R. J. Stat. Soft. 11(9), 1–20 (2004)
Google Scholar
Kassambara, A., Mundt, F.: factoextra: Extract and Visualize the Results of Multivariate Data Analyses, R package version 1.0.5 (2017)
Google Scholar
Leisch F., Dimitriadou, E.: mlbench: Machine Learning Benchmark Problems, R package version 2.1-1 (2010)
Google Scholar
Liu, Y., Li, Z., Xiong, H., Gao, X. Wu, J.: Understanding of internal clustering validation measures. In: Proceedings of the 2010 IEEE International Conference on Data Mining, pp. 911-916 (2010)
Google Scholar
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1: Statistics, pp. 281-297. University of California Press, Berkeley, California (1967)
Google Scholar
Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K.: cluster: Cluster Analysis Basics and Extensions, R package version 2.1.0 (2019)
Google Scholar
McInnes, L., Healy, J., Astels, S.: Comparing python clustering algorithms. https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html
McLachlan, G.J., Basford, K.E.: Mixture Models: Inference and Applications to Clustering, Statistics, Textbooks and Monographs, vol. 84 (1988)
Google Scholar
Mechelen, I.V., Boulesteix, A., Dangl, R., Dean, N., Guyon, I., Hennig, C., Leisch, F., Steinley, D.: Benchmarking in cluster analysis: A white paper (2018). arXiv:1809.10496v2
Murugesan, N., Cho, I., Tortora, C.: Benchmarking in cluster analysis: a study on spectral clustering, DBSCAN and K-means (github repository) https://irene1014.github.io/benchmarking-study-cluster-analysis (2020)
Paccanaro, A., Casbon, J.A., Saqi, M.A.S.: Spectral clustering of protein sequences. Nucleic. Acid Res. 34(5), 1571–1580 (2006)
Google Scholar
Peng, H., Pavlidis, N., Eckley, I., Tsalamanis, I.: Subspace clustering of very sparse high-dimensional data. In: 2018 IEEE International Conference on Big Data, pp. 3780–3783 (2018)
Google Scholar
Punzo, A., Mazza, A., McNicholas, P.D.: ContaminatedMixt: An R package for fitting parsimonious mixtures of multivariate contaminated normal distributions. J. Stat. Softw. 85(10), 1–25 (2018)
Google Scholar
R Core Team: R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria (2017)
Google Scholar
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Article Google Scholar
Sander, J., Ester, M., Kriegel, H., Xu, X.: Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications. Data Min. Knowl. Disc. 2, 169–194 (1998)
Article Google Scholar
Schubert, E., Sander, J., Ester, M., Kriegel, H., Xu, X.: DBSCAN revisited, revisited: why and how you should (Still) use DBSCAN. ACM T Database Syst. 42(3), 19 (2017)
Google Scholar
Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 228, 888–905 (2000)
Google Scholar
Tortora, C., ElSherbiny, A., Browne, R.P., Franczak B.C., McNicholas, P.D.: MixGHD: Model Based Clustering, Classification and Discriminant Analysis Using the Mixture of Generalized Hyperbolic Distributions, R package version 2.1 (2017)
Google Scholar
Tortora, C., Franczak, B.C., Browne, R.P., McNicholas, P.D.: A mixture of coalesced generalized hyperbolic distributions. J. Classif. 36, 26–57 (2019)
Google Scholar
Tukey, J.W.: A survey of sampling from contaminated distributions. In: Olkin, I., Ghurye, S.G., Hoeffding, W., Madow, W.G., Mann, H.B. (eds.) Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling, pp 448–495 (1960)
Google Scholar
Wang, S., Chen, F., Feng, J.: Spectral clustering of high-dimensional data via nonnegative matrix factorization. In: 2015 International Joint Conference on Neural Networks (IJCNN), pp 1–8 (2015)
Google Scholar
Wickham, H.: ggplot2: Elegant Graphics for Data Analysis. Springer, NY (2016)
Google Scholar
Wu, S., Feng, X., Zhou, W.: Spectral clustering of high-dimensional data exploiting sparse representation vectors. Neurocomputing 135, 229–239 (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics and Statistics, San José State University, One Washington Square, San José, CA, 95192, USA
Nivedha Murugesan, Irene Cho & Cristina Tortora

Authors

Nivedha Murugesan
View author publications
You can also search for this author in PubMed Google Scholar
Irene Cho
View author publications
You can also search for this author in PubMed Google Scholar
Cristina Tortora
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nivedha Murugesan .

Editor information

Editors and Affiliations

Department of Political Sciences, Aristotle University of Thessaloniki, Thessaloniki, Greece
Theodore Chadjipadelis
Department of Mathematical Sciences, University of Essex, Colchester, UK
Berthold Lausen
School of Education, Democritus University of Thrace, Alexandroupolis, Greece
Angelos Markos
Department of Data Science and Statistics, Korea National Open University, Seoul, Korea (Republic of)
Tae Rim Lee
Department of Statistical Sciences “Paolo Fortunati”, University of Bologna, Bologna, Italy
Angela Montanari
Department of Statistics and Data Science, Carnegie Mellon University, Pittsburgh, PA, USA
Rebecca Nugent

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Murugesan, N., Cho, I., Tortora, C. (2021). Benchmarking in Cluster Analysis: A Study on Spectral Clustering, DBSCAN, and K-Means. In: Chadjipadelis, T., Lausen, B., Markos, A., Lee, T.R., Montanari, A., Nugent, R. (eds) Data Analysis and Rationality in a Complex World. IFCS 2019. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-030-60104-1_20

Download citation

DOI: https://doi.org/10.1007/978-3-030-60104-1_20
Published: 15 February 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60103-4
Online ISBN: 978-3-030-60104-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Benchmarking in Cluster Analysis: A Study on Spectral Clustering, DBSCAN, and K-Means