Automated Determination of the Input Parameter of DBSCAN Based on Outlier Detection

Conference paper
Part of the IFIP Advances in Information and Communication Technology book series (IFIPAICT, volume 475)

Abstract

During the last two decades, DBSCAN (Density-Based Spatial Clustering of Applications with Noise) has been one of the most common clustering algorithms, that is also highly cited in the scientific literature. However, despite its strengths, DBSCAN has a shortcoming in parameter detection, which is done in interaction with the user, presenting some graphical representation of the data. This paper introduces a simple and effective method for automatically determining the input parameter of DBSCAN. The idea is based on a statistical technique for outlier detection, namely the empirical rule. This work also suggests a more accurate method for detecting the clusters that lie close to each other. Experimental results in comparison with the old method, together with the time complexity of the algorithm, which is the same as for the old algorithm, indicate that the proposed method is able to automatically determine the input parameter of DBSCAN quite reliably and efficiently.

Keywords

Clustering DBSCAN Empirical rule Machine learning Outlier detection Parameter determination Unsupervised learning 

References

  1. 1.
    Mitchell, T.M.: Machine Learning. McGraw Hill, New York (1997)MATHGoogle Scholar
  2. 2.
    Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach, 2nd edn. Prentice Hall, Englewood Cliffs (2002)MATHGoogle Scholar
  3. 3.
    Hertzmann, A., Fleet, D.: Machine Learning and Data Mining Lecture Notes, CSC 411/D11, Computer Science Department, University of Toronto (2012)Google Scholar
  4. 4.
    Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning: From Theory to Algorithm, Cambridge University Press, New York (2014)Google Scholar
  5. 5.
    Ventura, S., Luna, J.M.: Pattern Mining with Evolutionary Algorithms. Springer, Heidelberg (2016)CrossRefGoogle Scholar
  6. 6.
    Murphy, K.P.: Machine Learning: A Probabilistic Perspective. The MIT Press, Cambridge (2012)MATHGoogle Scholar
  7. 7.
    Ester, M., Kriegel, H.–P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis, E., Han, J., Fayyad, U.M. (eds.) Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pp. 226–231. AAAI Press (1996)Google Scholar
  8. 8.
    Estivill-Castro, V., Yang, J.: A Fast and robust general purpose clustering algorithm. In: Pacific Rim International Conference on Artificial Intelligence, pp. 208–218 (2000)Google Scholar
  9. 9.
    Rokach, L., Maimon, O.: Clustering methods. In: Rokach, L., Maimon, O.: The Data Mining and Knowledge Discovery Handbook, pp. 321–352. Springer Science + Business Media, Inc., Heidelberg (2005)Google Scholar
  10. 10.
    Berkhin, P.: Survey of Clustering Data Mining Techniques, Technical Report, Accrue Software, San Jose, CA (2002)Google Scholar
  11. 11.
    Han, J., Kamber, M.: Data Mining Concepts and Techniques, pp. 335–391. Morgan Kaufmann Publishers, San Francisco, CA (2001)Google Scholar
  12. 12.
    Everitt, B.S., Landau, S., Leese, M., Stahl, D.: Cluster Analysis, 5th edn. Wiley, Chichester (2011)CrossRefMATHGoogle Scholar
  13. 13.
    Beckmann, N., Kriegel, H.-P., Schneider, R., Seeger, B.: The R*-tree: an efficient and robust access method for points and rectangles. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Atlantic City, NJ, pp. 322–331 (1990)Google Scholar
  14. 14.
    Darong, H., Peng, W.: Grid-based DBSCAN algorithm with referential parameters. In: Proceedings of the International Conference on Applied Physics and Industrial Engineering (ICAPIE-2012), Phys. Procedia, vol. 24(B), pp. 1166–1170 (2012)Google Scholar
  15. 15.
    Smiti, A., Elouedi, Z.: DBSCAN-GM: An improved clustering method based on Gaussian means and DBSCAN techniques. In: 16th International Conference on Intelligent Engineering Systems (INES), pp. 573–578 (2012)Google Scholar
  16. 16.
    Karami, A., Johansson, R.: Choosing DBSCAN parameters automatically using differential evolution. Int. J. Comput. Appl. 91(7), 1–11 (2014)Google Scholar
  17. 17.
    Black, K.: Business Statistics: For Contemporary Decision Making (7th Edn.), Wiley, Hoboken, NJ (2011)Google Scholar
  18. 18.
    Ott, R. L., Longnecker, M.T.: An Introduction to Statistical Methods and Data Analysis (7th Edn.), Cengage Learning, Boston (2015)Google Scholar
  19. 19.
    Maddala, G.S.: Outliers. Introduction to Econometrics, 2nd edn, pp. 88–96. MacMillan, New York (1992)Google Scholar
  20. 20.
    Coolidge, F.L.: Statistics: A Gentle Introduction, p. 458. SAGE Publications, Inc., Thousand Oaks (2012)Google Scholar
  21. 21.
    Shafer, D. S., Zhang, Z.: Introductory Statistics, v. 1.0, Flatworld Knowledge, Washington, D.C. (2012)Google Scholar
  22. 22.
    Amidon, B.G., Ferryman, T.A., Cooley, S.K.: Data outlier detection using the Chebyshev theorem. In IEEE Aerospace Conference Proceedings, pp. 3814–3819 (2005)Google Scholar

Copyright information

© IFIP International Federation for Information Processing 2016

Authors and Affiliations

  1. 1.Institute for Computer Science and Business Information Systems (ICB)University of Duisburg-EssenEssenGermany

Personalised recommendations