Advertisement

Frontiers of Computer Science

, Volume 8, Issue 1, pp 83–99 | Cite as

MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data

  • Yaobin HeEmail author
  • Haoyu Tan
  • Wuman Luo
  • Shengzhong Feng
  • Jianping Fan
Research Article

Abstract

DBSCAN (density-based spatial clustering of applications with noise) is an important spatial clustering technique that is widely adopted in numerous applications. As the size of datasets is extremely large nowadays, parallel processing of complex data analysis such as DBSCAN becomes indispensable. However, there are three major drawbacks in the existing parallel DBSCAN algorithms. First, they fail to properly balance the load among parallel tasks, especially when data are heavily skewed. Second, the scalability of these algorithms is limited because not all the critical sub-procedures are parallelized. Third, most of them are not primarily designed for shared-nothing environments, which makes them less portable to emerging parallel processing paradigms. In this paper, we present MR-DBSCAN, a scalable DBSCAN algorithm using MapReduce. In our algorithm, all the critical sub-procedures are fully parallelized. As such, there is no performance bottleneck caused by sequential processing. Most importantly, we propose a novel data partitioning method based on computation cost estimation. The objective is to achieve desirable load balancing even in the context of heavily skewed data. Besides, We conduct our evaluation using real large datasets with up to 1.2 billion points. The experiment results well confirm the efficiency and scalability of MR-DBSCAN.

Keywords

data clustering parallel algorithm data mining load balancing 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Ester M, Kriegel H P, Sander J, Xu X. A densitybased algorithm for discovering clusters in large spatial databases. Data Mining and Knowledge Discovery, 1996, 96: 226–231Google Scholar
  2. 2.
    MacQueen J B. Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967, 281–297Google Scholar
  3. 3.
    Zhang T, Ramakrishnan R, Livny M. Birch: an efficient data clustering method for very large databases. In: Proceedings of 1996 the ACM SIGMOD Conference on Managemnet of Data. 1996, 103–114CrossRefGoogle Scholar
  4. 4.
    Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statisticai Societ, 1977, 39(1): 1–38zbMATHMathSciNetGoogle Scholar
  5. 5.
    Wang W, Yang J, Muntz R R. Sting: A statistical information grid approach to spatial data mining. In: Proceedings of the 23rd International Conference on Very Large Data Bases, 1997, 186–195Google Scholar
  6. 6.
    Microsoft Academic Search. Top publications in data mining. http://academic.research.microsoft.com/CSDirectory/paper_category_ 7.html. 2013Google Scholar
  7. 7.
    Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. 2008, 107–113Google Scholar
  8. 8.
    White T. Hadoop: The Definitive Guide, 1st edition. O’Reilly Media, Inc., 2009Google Scholar
  9. 9.
    Berger M, Bokhari S. A partitioning strategy for nonuniform problems on multiprocessors. IEEE Transactions on Computers, 1987, 36: 570–580CrossRefGoogle Scholar
  10. 10.
    Dai B R, Lin I C. Efficient map/reduce-based dbscan algorithm with optimized data partition. In: Proceedings of the 5th IEEE International Conference on Cloud Computing. 2012, 59–66Google Scholar
  11. 11.
    Leutenegger S T, Edgington J M, Lopez M A. Str: a simple and efficient algorithm for r-tree packing. In: Proceedings of the 1997 IEEE International Conference on Data Engineering. 1997, 497–506Google Scholar
  12. 12.
    Theodoridis Y, Sellis T. A model for the prediction of r-tree perfor mance. In: Proceedings of the 15th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. 1996, 161–171Google Scholar
  13. 13.
    United States Census Bureau. TIGER/Line Shapefiles. http://www.census.gov/geo/maps-data/data/tiger-line.html
  14. 14.
    Sander J, Ester M, Kriegel H P, Xu X. Density-based clustering in spatial databases: The algorithm gdbscan and its applications. Data Mining and Knowledge Discovery, 1998, 2(2): 169–194CrossRefGoogle Scholar
  15. 15.
    Ankerst M, Breunig M M, Kriegel H P, Sander J. Optics: ordering points to identify the clustering structure. SIGMOD Record, 1999, 28: 49–60CrossRefGoogle Scholar
  16. 16.
    Januzaj E, Kriegel H P, Pfeifle M. Scalable density-based distributed clustering. In: Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases. 2004, 231–244Google Scholar
  17. 17.
    Zhao W, Ma H, He Q. Parallel k-means clustering based on mapreduce. In: Proceedings of the 1st International Conference on Cloud Computing. 2009, 674-679Google Scholar
  18. 18.
    Kwon Y, Nunley D, Gardner J P, Balazinska M, Howe B, Loebman S. Scalable clustering algorithm for n-body simulations in a sharednothing cluster. In: Proceedings of the 22nd International Conference on Scientific and Statistical Database Management. 2010, 132–150Google Scholar
  19. 19.
    Bentley J L. Multidimensional binary search trees used for associative searching. Communications of the ACM, 1975, 18: 509–517CrossRefzbMATHGoogle Scholar
  20. 20.
    Xu X, Jäger J, Kriegel H P. A fast parallel clustering algorithm for large spatial databases. Data Mining and Knowledge Discovery, 1999, 3: 263–290CrossRefGoogle Scholar
  21. 21.
    He Y, Tan H, Luo W, Mao H, Ma D, Feng S, Fan J. MR-DBSCAN: an efficient parallel density-based clustering algorithm using mapreduce. In: Proceedings of the 2011 IEEE International Conference on Parallel and Distributed Systems. 2011, 473–480CrossRefGoogle Scholar

Copyright information

© Higher Education Press and Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Yaobin He
    • 1
    • 3
    Email author
  • Haoyu Tan
    • 2
  • Wuman Luo
    • 2
  • Shengzhong Feng
    • 1
  • Jianping Fan
    • 1
  1. 1.Shenzhen Institutes of Advanced TechnologyChinese Academy of SciencesShenzhenChina
  2. 2.Department of Computer Science, Guangzhou HKUST Fok Ying Tung Research InstituteHong Kong University of Science and TechnologyHong KongChina
  3. 3.University of Chinese Academy of SciencesBeijingChina

Personalised recommendations