Density-Based Multiscale Analysis for Clustering in Strong Noise Settings

  • Tiantian Zhang
  • Bo YuanEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10400)


Finding clustering patterns in data is challenging when clusters can be of arbitrary shapes and the data contains high percentage (e.g., 80%) of noise. This paper presents a novel technique named density-based multiscale analysis for clustering (DBMAC) that can conduct noise-robust clustering without any strict assumption on the shapes of clusters. Firstly, DBMAC calculates the r-neighborhood statistics with different r (radius) values. Next, instead of trying to find a single optimal r value, a set of radius values appropriate for separating “clustered” objects and “noisy” objects is identified, using a formal statistical method for multimodality test. Finally, the classical DBSCAN is employed to perform clustering on the subset of data with significantly less amount of noise. Experiment results confirm that DBMAC is superior to classical DBSCAN in strong noise settings and also outperforms the latest technique SkinnyDip when the data contains arbitrarily shaped clusters.


Multiscale analysis Density-based clustering Statistical test 


  1. 1.
    Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall Advanced Reference Series: Computer Science. Prentice Hall College Div (1988)Google Scholar
  2. 2.
    Do, C.B., Batzoglou, S.: What is the expectation maximization algorithm. Nat. Biotechnol. 26(8), 897–899 (2008)CrossRefGoogle Scholar
  3. 3.
    Zelnik-Manor, L., Perona, P.: Self-tuning spectral clustering. Adv. Neural. Inf. Process. Syst. 17, 1601–1608 (2004)Google Scholar
  4. 4.
    Ben-David, S., Haghtalab, N.: Clustering in the presence of background noise. In: Proceedings of the 31st International Conference on Machine Learning, vol. 32, pp. 280–288 (2014)Google Scholar
  5. 5.
    Murtagh, F., Raftery, A.E.: Fitting straight lines to point patterns. Pattern Recogn. 17(5), 479–483 (1984)CrossRefGoogle Scholar
  6. 6.
    Banfield, J.D., Raftery, A.E.: Model-based Gaussian and non-Gaussian clustering. Biometrics 49(3), 803–821 (1993)MathSciNetCrossRefzbMATHGoogle Scholar
  7. 7.
    Dave, R.N.: Characterization and detection of noise in clustering. Pattern Recogn. Lett. 12(11), 657–664 (1991)CrossRefGoogle Scholar
  8. 8.
    Cuesta-Albertos, J.A., Gordaliza, A., Matran, C.: Trimmed k-means: an attempt to robustifyquantizers. Ann. Stat. 25(2), 553–576 (1997)CrossRefzbMATHGoogle Scholar
  9. 9.
    Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, vol. 96, no. 34, pp. 226–231 (1996)Google Scholar
  10. 10.
    Ertöz, L., Steinbach, M., Kumar, V.: Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: Proceedings of the 3rd SIAM International Conference on Data Mining, vol. 112, pp. 47–58 (2003)Google Scholar
  11. 11.
    Böhm, C., Plant, C., Shao, J., Yang, Q.: Clustering by synchronization. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 583–592 (2010)Google Scholar
  12. 12.
    Goebl, S., He, X., Plant, C., Böhm, C.: Finding the optimal subspace for clustering. In: IEEE International Conference on Data Mining, pp. 130–139 (2014)Google Scholar
  13. 13.
    Dasgupta, A., Raftery, A.E.: Detecting features in spatial point processes with clutter via model-based clustering. J. Am. Stat. Assoc. Theory Methods 93(441), 294–302 (1998)CrossRefzbMATHGoogle Scholar
  14. 14.
    Wong, W.K., Moore, A.: Efficient algorithms for non-parametric clustering with clutter. In: Proceedings of the 34th Interface Symposium, vol. 34, pp. 541–553 (2002)Google Scholar
  15. 15.
    Cuevas, A., Febrero, M., Fraiman, R.: Estimating the number of clusters. Can. J. Stat. 28(2), 367–382 (2000)MathSciNetCrossRefzbMATHGoogle Scholar
  16. 16.
    Li, J., Huang, X., Selke, C., Yong, J.: A fast algorithm for finding correlation clusters in noise data. In: Zhou, Z.-H., Li, H., Yang, Q. (eds.) PAKDD 2007. LNCS, vol. 4426, pp. 639–647. Springer, Heidelberg (2007). doi: 10.1007/978-3-540-71701-0_68 CrossRefGoogle Scholar
  17. 17.
    Maurus, S., Plant, C.: Skinny-dip: clustering in a sea of noise. In: Proceedings of the 22nd International Conference on Knowledge Discovery and Data Mining, pp. 1055–1064 (2016)Google Scholar
  18. 18.
    Hartigan, J.A., Hartigan, P.M.: The dip test of unimodality. Ann. Stat. 13(1), 70–84 (1985)MathSciNetCrossRefzbMATHGoogle Scholar
  19. 19.
    Strehl, A., Ghosh, J.: Cluster ensembles-a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3, 583–617 (2002)MathSciNetzbMATHGoogle Scholar
  20. 20.
    Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1073–1080 (2009)Google Scholar
  21. 21.
    Guha, S., Rastogi, R., Shim, K.: CURE: an efficient clustering algorithm for large databases. ACM SIGMOD Rec. Int. Conf. Manag. Data 27(2), 73–84 (1998)CrossRefzbMATHGoogle Scholar
  22. 22.
    Chaoji, V., Al Hasan, M., Salem, S., Zaki, M.J.: SPARCL: efficient and effective shape-based clustering. In: IEEE International Conference on Data Mining, pp. 93–102 (2008)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Intelligent Computing Lab, Division of Informatics, Graduate School at ShenzhenTsinghua UniversityShenzhenPeople’s Republic of China

Personalised recommendations