Skip to main content

Benchmarking in Cluster Analysis: A Study on Spectral Clustering, DBSCAN, and K-Means

  • Conference paper
  • First Online:
Data Analysis and Rationality in a Complex World (IFCS 2019)

Abstract

We perform a benchmarking study to identify the advantages and the drawbacks of Spectral Clustering and Density-Based Spatial Clustering of Applications with Noise (DBSCAN). We compare the two methods with the classic K-means clustering. The methods are performed on five simulated and three real data sets. The obtained clustering results are compared using external and internal indices, as well as run times. Although there is not one method that performs best on all types of data sets, we find that DBSCAN should generally be reserved for non-convex data with well-separated clusters or for data with many outliers. Spectral Clustering has better overall performance but with higher instability of the results compared to K-means, and longer run time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Performed on a MacBook Pro 2017 - Processor: 2.3 GHz Intel Core i5; Memory: 8GB.

  2. 2.

    Link: https://irene1014.github.io/benchmarking-study-cluster-analysis/.

References

  • Barndorff-Nielsen, O.E.: Exponentially decreasing distributions for the logarithm of particle size. Proc. R. Soc. Lond. A. 353(1674), 401–419 (1977)

    Google Scholar 

  • Becker, R.A., Wilks, A.R.: (Orig. S code), R. Brownrigg (R Version), T.P. Minka and A. Deckmyn (Enhancements), maps: Draw Geographical Maps. R package v. 3.3.0 (2018)

    Google Scholar 

  • Becker, R.A., Wilks A.R.: (Original S code), R. Brownrigg (R Version), mapdata: Extra Map Databases. R package version 2.3.0. (2018)

    Google Scholar 

  • Berhane, F.: Data distributions where Kmeans clustering fails: Can DBSCAN be a solution? https://datascience-enthusiast.com/Python/DBSCAN_Kmeans.html (2020)

  • Desgraupes, B.: clusterCrit: Clustering Indices. R package version 1.2.8. (2018)

    Google Scholar 

  • Emadi, H.S., Mazinani, S.M.: A novel anomaly detection algorithm using DBSCAN and SVM in wireless sensor networks. Wirel. Pers Comm. 98, 2025–2035 (2018)

    Google Scholar 

  • Ester, M., Kriegel, H., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD-96 Proceedings, pp. 226-231 (1996)

    Google Scholar 

  • Fitch, J.P., Khan, N.A.M., Tortora, C.: Back pain: a spectral clustering approach. Arch. Data Sci. Ser. B 1(1) (online first) (2019)

    Google Scholar 

  • Francis, Z., Villagrasa, C., Clairand, I.: Simulation of DNA damage clustering after proton irradiation using an adapted DBSCAN algorithm. Comput. Methods Programs Biomed. 101(3), 265–270 (2011)

    Google Scholar 

  • Genz, A., Bretz, F., Miwa, T., Mi, X., Leisch, F., Scheipl, F., Hothorn, T.: mvtnorm: Multivariate Normal and t Distributions, R package version 1.0-11 (2019)

    Google Scholar 

  • Hahsler, M., Piekenbrock, M.: dbscan: Density Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms, R package version 1.1-1 (2017)

    Google Scholar 

  • Hornik, K., Grün, B.: movMF: An R package for fitting mixtures of von mises-fisher distributions. J. Stat. Soft. 58(10), 1–31 (2014)

    Google Scholar 

  • Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)

    Google Scholar 

  • Kahle, D., Wickham, H.: ggmap: spatial visualization with ggplot2. R J. 5(1), 144–161 (2013)

    Google Scholar 

  • Karatzoglou, A., Smola, A., Hornik, K., Zeileis, A.: kernlab—An S4 Package for kernel methods in R. J. Stat. Soft. 11(9), 1–20 (2004)

    Google Scholar 

  • Kassambara, A., Mundt, F.: factoextra: Extract and Visualize the Results of Multivariate Data Analyses, R package version 1.0.5 (2017)

    Google Scholar 

  • Leisch F., Dimitriadou, E.: mlbench: Machine Learning Benchmark Problems, R package version 2.1-1 (2010)

    Google Scholar 

  • Liu, Y., Li, Z., Xiong, H., Gao, X. Wu, J.: Understanding of internal clustering validation measures. In: Proceedings of the 2010 IEEE International Conference on Data Mining, pp. 911-916 (2010)

    Google Scholar 

  • MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1: Statistics, pp. 281-297. University of California Press, Berkeley, California (1967)

    Google Scholar 

  • Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K.: cluster: Cluster Analysis Basics and Extensions, R package version 2.1.0 (2019)

    Google Scholar 

  • McInnes, L., Healy, J., Astels, S.: Comparing python clustering algorithms. https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html

  • McLachlan, G.J., Basford, K.E.: Mixture Models: Inference and Applications to Clustering, Statistics, Textbooks and Monographs, vol. 84 (1988)

    Google Scholar 

  • Mechelen, I.V., Boulesteix, A., Dangl, R., Dean, N., Guyon, I., Hennig, C., Leisch, F., Steinley, D.: Benchmarking in cluster analysis: A white paper (2018). arXiv:1809.10496v2

  • Murugesan, N., Cho, I., Tortora, C.: Benchmarking in cluster analysis: a study on spectral clustering, DBSCAN and K-means (github repository) https://irene1014.github.io/benchmarking-study-cluster-analysis (2020)

  • Paccanaro, A., Casbon, J.A., Saqi, M.A.S.: Spectral clustering of protein sequences. Nucleic. Acid Res. 34(5), 1571–1580 (2006)

    Google Scholar 

  • Peng, H., Pavlidis, N., Eckley, I., Tsalamanis, I.: Subspace clustering of very sparse high-dimensional data. In: 2018 IEEE International Conference on Big Data, pp. 3780–3783 (2018)

    Google Scholar 

  • Punzo, A., Mazza, A., McNicholas, P.D.: ContaminatedMixt: An R package for fitting parsimonious mixtures of multivariate contaminated normal distributions. J. Stat. Softw. 85(10), 1–25 (2018)

    Google Scholar 

  • R Core Team: R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria (2017)

    Google Scholar 

  • Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)

    Article  Google Scholar 

  • Sander, J., Ester, M., Kriegel, H., Xu, X.: Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications. Data Min. Knowl. Disc. 2, 169–194 (1998)

    Article  Google Scholar 

  • Schubert, E., Sander, J., Ester, M., Kriegel, H., Xu, X.: DBSCAN revisited, revisited: why and how you should (Still) use DBSCAN. ACM T Database Syst. 42(3), 19 (2017)

    Google Scholar 

  • Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 228, 888–905 (2000)

    Google Scholar 

  • Tortora, C., ElSherbiny, A., Browne, R.P., Franczak B.C., McNicholas, P.D.: MixGHD: Model Based Clustering, Classification and Discriminant Analysis Using the Mixture of Generalized Hyperbolic Distributions, R package version 2.1 (2017)

    Google Scholar 

  • Tortora, C., Franczak, B.C., Browne, R.P., McNicholas, P.D.: A mixture of coalesced generalized hyperbolic distributions. J. Classif. 36, 26–57 (2019)

    Google Scholar 

  • Tukey, J.W.: A survey of sampling from contaminated distributions. In: Olkin, I., Ghurye, S.G., Hoeffding, W., Madow, W.G., Mann, H.B. (eds.) Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling, pp 448–495 (1960)

    Google Scholar 

  • Wang, S., Chen, F., Feng, J.: Spectral clustering of high-dimensional data via nonnegative matrix factorization. In: 2015 International Joint Conference on Neural Networks (IJCNN), pp 1–8 (2015)

    Google Scholar 

  • Wickham, H.: ggplot2: Elegant Graphics for Data Analysis. Springer, NY (2016)

    Google Scholar 

  • Wu, S., Feng, X., Zhou, W.: Spectral clustering of high-dimensional data exploiting sparse representation vectors. Neurocomputing 135, 229–239 (2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nivedha Murugesan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Murugesan, N., Cho, I., Tortora, C. (2021). Benchmarking in Cluster Analysis: A Study on Spectral Clustering, DBSCAN, and K-Means. In: Chadjipadelis, T., Lausen, B., Markos, A., Lee, T.R., Montanari, A., Nugent, R. (eds) Data Analysis and Rationality in a Complex World. IFCS 2019. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-030-60104-1_20

Download citation

Publish with us

Policies and ethics