A Fast DBSCAN Algorithm with Spark Implementation

Han, Dianwei; Agrawal, Ankit; Liao, Wei-keng; Choudhary, Alok

doi:10.1007/978-981-10-8476-8_9

Dianwei Han⁶,
Ankit Agrawal⁶,
Wei-keng Liao⁶ &
…
Alok Choudhary⁶

Part of the book series: Studies in Big Data ((SBD,volume 44))

2191 Accesses
4 Citations

Abstract

DBSCAN is a well-known clustering algorithm which is based on density and is able to identify arbitrary shaped clusters and eliminate noise data. Parallelization of DBSCAN is a challenging work because there is an inherent sequential data access order and based on MPI or OpenMP environments, there exist the issues of lack of fault-tolerance and there is no guarantee that workload is balanced. Moreover, programming with MPI requires data scientists to handle communication between nodes which is a big challenge. We present a new parallel DBSCAN algorithm using Spark. kd-tree technique is applied in our algorithm to reduce search time. More specifically, a novel merge approach is used so that no communication between executors is required while partial clusters are generated. Appropriate and efficient data structures are carefully used in our study: Using Queue to contain neighbors of the data point, and using Hashtable when checking the status of and processing the data points. Also other advanced data structures from Spark are applied to make our implementation more effective. We implement the algorithm in Java and evaluate its scalability by using different number of processing cores. Our experiments demonstrate that the algorithm we propose scales up very well. Using data sets containing up to 1 million high-dimensional points, we show that our proposed algorithm achieves speedups up to 6 using 8 cores (10 k), 10 using 32 cores (100 k), and 137 using 512 cores (1 m). Another experiment using 10 k data points is conducted and the result shows that the algorithm with MapReduce achieves speedups to 1.3 using 2 cores, 2.0 using 4 cores, and 3.2 using 8 cores.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Agrawal, R., & Srikant, R. (1994). Quest synthetic data generator, IBM Almaden Research Center.
Google Scholar
Beckmann, N., et al. (1990). The r*-tree: An efficient and robust access method for points and rectangles. In: Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data (Vol. 19, no. 2, pp. 323–331).
Google Scholar
Bentley, J. (1975). Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9), 509–517.
Article MATH Google Scholar
Brecheisen, S., et al. (2006). Parallel density-based clustering of complex objects. Advances in Knowledge Discovery and Data Mining, pp. 179–188.
Google Scholar
DOE Office of Science (2015, September 17). Edison Configuration (Online). https://www.nersc.gov/users/computational-systems/edison/configuration/.
Ester, M., et al. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (Vol. 1996, pp. 226–231). AAAI Press.
Google Scholar
Fu, Y., et al. (2011). Research on parallel DBSCAN algorithm design based on mapreduce. Advanced Materials Research 301, 1133–1138.
Article Google Scholar
Han, J., et al. (2011). Data mining: Concepts and Techniques. Morgan Kaufmann.
Google Scholar
He, Y., et al. (2014). MR-DBSCAN: A scalable mapreduce-based DBSCAN algorithm for heavily skewed data. Frontiers of Computer Science, 8(1), 83–99.
Article MathSciNet Google Scholar
Kakde, H. M. (2005, August 25). Range Searching using Kd Tree (Online). http://www.cs.utah.edu/lifeifei/cs6931/kdtree.pdf.
Kang, S. J., et al. (2015). Performance comparison of OpenMP, MPI, and MapReduce in practical problems. Advances in Multimedia 2015.
Article Google Scholar
Karau, H., et al. (2015). Learning Spark: Lightning-fast Data Analysis. O’Reilly Media.
Google Scholar
MacQueen, J., et al. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability (Vol. 1, pp. 281–297). USA.
Google Scholar
Noticewala, M., & Vaghela, D. (2014). MR-IDBSCAN: Efficient parallel incremental DBSCAN algorithm using mapreduce. International Journal of Computer Applications 93(4), 13–17.
Article Google Scholar
Patwary, M. M. A., et al. (2012). A new scalable parallel DBSCAN algorithm using the disjoint-set data structure. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 62:1–62:11. IEEE Computer Society Press.
Google Scholar
Pisharath, J., et al. (2010). NU-MineBench 3.0. Technical Report CUCIS-2005-08-01, Northwestern University (Technical Report).
Google Scholar
Sakr, S., & Gaber, M. M. (2014). Large Scale and Big Data: Processing and Management. CRC Press.
Google Scholar
Spark, A. (2015). Spark Programming Guide (Online). http://spark.apache.org/docs/latest/programming-guide.html.
Sheikholeslami, G., et al. (2000). WaveCluster: A wavelet based clustering approach for spatial data in very large databases. The VLDB Journal, 8(3), 289–304.
Article Google Scholar
Tan, P., et al. (2005). Introduction to Data Mining. Pearson.
Google Scholar
White, T. (2011). Hadoop: The Definitive Guide. O’Reilly Media.
Google Scholar
Zaharia, M., et al. (2012) Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (pp. 2–2). USENIX Association.
Google Scholar
Zaharia, M. (2014). An Architecture for Fast and General Data Processing on Large Clusters. Technical Report UCB/EECS-2014-12, University of California, Berkeley (Technical Report).
Google Scholar
Zhang, T., et al. (1996). BIRCH: An efficient data clustering method for very large databases. In ACM SIGMOD Record (Vol. 25, Issue. 2, pp. 103–114). ACM.
Article Google Scholar
Zhou, et al. (2000). Approaches for scaling DBSCAN algorithm to large spatial databases. Journal of Computer Science and Technology, 15(6), 509–526.
Article MATH Google Scholar

Download references

Acknowledgements

This work is supported in part by the following grants: NSF awards CCF-1409601, IIS-1343639, and CCF-1029166; DOE awards DESC0007456 and DE-SC0014330; AFOSR award FA9550-12-1-0458; NIST award 70NANB14H012. This research used Edison Cray XC30 computer of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.

Author information

Authors and Affiliations

EECS Department, Northwestern University, Evanston, IL, 60208, USA
Dianwei Han, Ankit Agrawal, Wei-keng Liao & Alok Choudhary

Authors

Dianwei Han
View author publications
You can also search for this author in PubMed Google Scholar
Ankit Agrawal
View author publications
You can also search for this author in PubMed Google Scholar
Wei-keng Liao
View author publications
You can also search for this author in PubMed Google Scholar
Alok Choudhary
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Dianwei Han , Ankit Agrawal , Wei-keng Liao or Alok Choudhary .

Editor information

Editors and Affiliations

School of Computing Science and Engineering, Vellore Institute of Technology, Vellore, Tamil Nadu, India
Sanjiban Sekhar Roy
Department of Civil Engineering, National Institute of Technology Patna, Patna, Bihar, India
Pijush Samui
University of Southern Queensland, Springfield, Queensland, Australia
Ravinesh Deo
Polytechnic University of Milan, Milan, Italy
Stavros Ntalampiras

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Han, D., Agrawal, A., Liao, Wk., Choudhary, A. (2018). A Fast DBSCAN Algorithm with Spark Implementation. In: Roy, S., Samui, P., Deo, R., Ntalampiras, S. (eds) Big Data in Engineering Applications. Studies in Big Data, vol 44. Springer, Singapore. https://doi.org/10.1007/978-981-10-8476-8_9

Download citation

DOI: https://doi.org/10.1007/978-981-10-8476-8_9
Published: 03 May 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-8475-1
Online ISBN: 978-981-10-8476-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics