Skip to main content

A Fast DBSCAN Algorithm with Spark Implementation

  • Chapter
  • First Online:
Big Data in Engineering Applications

Part of the book series: Studies in Big Data ((SBD,volume 44))

Abstract

DBSCAN is a well-known clustering algorithm which is based on density and is able to identify arbitrary shaped clusters and eliminate noise data. Parallelization of DBSCAN is a challenging work because there is an inherent sequential data access order and based on MPI or OpenMP environments, there exist the issues of lack of fault-tolerance and there is no guarantee that workload is balanced. Moreover, programming with MPI requires data scientists to handle communication between nodes which is a big challenge. We present a new parallel DBSCAN algorithm using Spark. kd-tree technique is applied in our algorithm to reduce search time. More specifically, a novel merge approach is used so that no communication between executors is required while partial clusters are generated. Appropriate and efficient data structures are carefully used in our study: Using Queue to contain neighbors of the data point, and using Hashtable when checking the status of and processing the data points. Also other advanced data structures from Spark are applied to make our implementation more effective. We implement the algorithm in Java and evaluate its scalability by using different number of processing cores. Our experiments demonstrate that the algorithm we propose scales up very well. Using data sets containing up to 1 million high-dimensional points, we show that our proposed algorithm achieves speedups up to 6 using 8 cores (10 k), 10 using 32 cores (100 k), and 137 using 512 cores (1 m). Another experiment using 10 k data points is conducted and the result shows that the algorithm with MapReduce achieves speedups to 1.3 using 2 cores, 2.0 using 4 cores, and 3.2 using 8 cores.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Agrawal, R., & Srikant, R. (1994). Quest synthetic data generator, IBM Almaden Research Center.

    Google Scholar 

  2. Beckmann, N., et al. (1990). The r*-tree: An efficient and robust access method for points and rectangles. In: Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data (Vol. 19, no. 2, pp. 323–331).

    Google Scholar 

  3. Bentley, J. (1975). Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9), 509–517.

    Article  MATH  Google Scholar 

  4. Brecheisen, S., et al. (2006). Parallel density-based clustering of complex objects. Advances in Knowledge Discovery and Data Mining, pp. 179–188.

    Google Scholar 

  5. DOE Office of Science (2015, September 17). Edison Configuration (Online). https://www.nersc.gov/users/computational-systems/edison/configuration/.

  6. Ester, M., et al. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (Vol. 1996, pp. 226–231). AAAI Press.

    Google Scholar 

  7. Fu, Y., et al. (2011). Research on parallel DBSCAN algorithm design based on mapreduce. Advanced Materials Research 301, 1133–1138.

    Article  Google Scholar 

  8. Han, J., et al. (2011). Data mining: Concepts and Techniques. Morgan Kaufmann.

    Google Scholar 

  9. He, Y., et al. (2014). MR-DBSCAN: A scalable mapreduce-based DBSCAN algorithm for heavily skewed data. Frontiers of Computer Science, 8(1), 83–99.

    Article  MathSciNet  Google Scholar 

  10. Kakde, H. M. (2005, August 25). Range Searching using Kd Tree (Online). http://www.cs.utah.edu/lifeifei/cs6931/kdtree.pdf.

  11. Kang, S. J., et al. (2015). Performance comparison of OpenMP, MPI, and MapReduce in practical problems. Advances in Multimedia 2015.

    Article  Google Scholar 

  12. Karau, H., et al. (2015). Learning Spark: Lightning-fast Data Analysis. O’Reilly Media.

    Google Scholar 

  13. MacQueen, J., et al. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability (Vol. 1, pp. 281–297). USA.

    Google Scholar 

  14. Noticewala, M., & Vaghela, D. (2014). MR-IDBSCAN: Efficient parallel incremental DBSCAN algorithm using mapreduce. International Journal of Computer Applications 93(4), 13–17.

    Article  Google Scholar 

  15. Patwary, M. M. A., et al. (2012). A new scalable parallel DBSCAN algorithm using the disjoint-set data structure. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 62:1–62:11. IEEE Computer Society Press.

    Google Scholar 

  16. Pisharath, J., et al. (2010). NU-MineBench 3.0. Technical Report CUCIS-2005-08-01, Northwestern University (Technical Report).

    Google Scholar 

  17. Sakr, S., & Gaber, M. M. (2014). Large Scale and Big Data: Processing and Management. CRC Press.

    Google Scholar 

  18. Spark, A. (2015). Spark Programming Guide (Online). http://spark.apache.org/docs/latest/programming-guide.html.

  19. Sheikholeslami, G., et al. (2000). WaveCluster: A wavelet based clustering approach for spatial data in very large databases. The VLDB Journal, 8(3), 289–304.

    Article  Google Scholar 

  20. Tan, P., et al. (2005). Introduction to Data Mining. Pearson.

    Google Scholar 

  21. White, T. (2011). Hadoop: The Definitive Guide. O’Reilly Media.

    Google Scholar 

  22. Zaharia, M., et al. (2012) Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (pp. 2–2). USENIX Association.

    Google Scholar 

  23. Zaharia, M. (2014). An Architecture for Fast and General Data Processing on Large Clusters. Technical Report UCB/EECS-2014-12, University of California, Berkeley (Technical Report).

    Google Scholar 

  24. Zhang, T., et al. (1996). BIRCH: An efficient data clustering method for very large databases. In ACM SIGMOD Record (Vol. 25, Issue. 2, pp. 103–114). ACM.

    Article  Google Scholar 

  25. Zhou, et al. (2000). Approaches for scaling DBSCAN algorithm to large spatial databases. Journal of Computer Science and Technology, 15(6), 509–526.

    Article  MATH  Google Scholar 

Download references

Acknowledgements

This work is supported in part by the following grants: NSF awards CCF-1409601, IIS-1343639, and CCF-1029166; DOE awards DESC0007456 and DE-SC0014330; AFOSR award FA9550-12-1-0458; NIST award 70NANB14H012. This research used Edison Cray XC30 computer of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Dianwei Han , Ankit Agrawal , Wei-keng Liao or Alok Choudhary .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Han, D., Agrawal, A., Liao, Wk., Choudhary, A. (2018). A Fast DBSCAN Algorithm with Spark Implementation. In: Roy, S., Samui, P., Deo, R., Ntalampiras, S. (eds) Big Data in Engineering Applications. Studies in Big Data, vol 44. Springer, Singapore. https://doi.org/10.1007/978-981-10-8476-8_9

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-8476-8_9

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-8475-1

  • Online ISBN: 978-981-10-8476-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics