SimScience 2017: Simulation Science pp 251-271 | Cite as

Elephant Against Goliath: Performance of Big Data Versus High-Performance Computing DBSCAN Clustering Implementations

  • Helmut NeukirchenEmail author
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 889)


Data is often mined using clustering algorithms such as Density-Based Spatial Clustering of Applications with Noise (DBSCAN). However, clustering is computationally expensive and thus for big data, parallel processing is required. The two prevalent paradigms for parallel processing are High-Performance Computing (HPC) based on Message Passing Interface (MPI) or Open Multi-Processing (OpenMP) and the newer big data frameworks such as Apache Spark or Hadoop. This paper surveys for these two different paradigms publicly available implementations that aim at parallelizing DBSCAN and compares their performance. As a result, it is found that the big data implementations are not yet mature and in particular for skewed data, the implementation’s decomposition of the input data into parallel tasks has a huge influence on the performance in terms of run-time due to load imbalance.


Density-based Spatial Clustering Of Applications With Noise (DBSCAN) Big Data Apache Spark OpenMP Resilient Distributed Datasets (RDD) 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



The author likes to thank all those who provided implementations of their DBSCAN algorithms. Special thanks go to the division Federated Systems and Data at the Jülich Supercomputing Centre (JSC), in particular to the research group High Productivity Data Processing and its head Morris Riedel. The author gratefully acknowledges the computing time granted on the supercomputer JUDGE at Jülich Supercomputing Centre (JSC).

The HPDBSCAN implementation of DBSCAN will be used as pilot application in the research project DEEP-EST (Dynamical Exascale Entry Platform – Extreme Scale Technologies) which receives funding from the European Union Horizon 2020 – the Framework Programme for Research and Innovation (2014–2020) under Grant Agreement number 754304.


  1. 1.
    Schmelling, M., Britsch, M., Gagunashvili, N., Gudmundsson, H.K., Neukirchen, H., Whitehead, N.: RAVEN – boosting data analysis for the LHC experiments. In: Jónasson, K. (ed.) PARA 2010. LNCS, vol. 7134, pp. 206–214. Springer, Heidelberg (2012). Scholar
  2. 2.
    Memon, S., Vallot, D., Zwinger, T., Neukirchen, H.: Coupling of a continuum ice sheet model and a discrete element calving model using a scientific workflow system. In: Geophysical Research Abstracts. European Geosciences Union (EGU) General Assembly 2017, Copernicus, vol. 19 (2017). EGU2017-8499Google Scholar
  3. 3.
    Glaser, F., Neukirchen, H., Rings, T., Grabowski, J.: Using MapReduce for high energy physics data analysis. In: 2013 International Symposium on MapReduce and Big Data Infrastructure. IEEE (2013/2014).
  4. 4.
    Ester, M., Kriegel, H., Sander, J., Xu, X.: Density-based spatial clustering of applications with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. AAAI Press (1996)Google Scholar
  5. 5.
    Apache Software Foundation: Apache Hadoop (2017).
  6. 6.
    Neukirchen, H.: Performance of big data versus high-performance computing: some observations. In: Clausthal-Göttingen International Workshop on Simulation Science, Göttingen, Germany, 27–28 April 2017 (2017). Extended AbstractGoogle Scholar
  7. 7.
    Neukirchen, H.: Survey and performance evaluation of DBSCAN spatial clustering implementations for big data and high-performance computing paradigms. Technical report VHI-01-2016, Engineering Research Institute, University of Iceland (2016)Google Scholar
  8. 8.
    Neukirchen, H.: Elephant against Goliath: Performance of Big Data versus High-Performance Computing DBSCAN Clustering Implementations. EUDAT B2SHARE record (2017).
  9. 9.
    Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data, vol. 14, issue 2. ACM (1984).
  10. 10.
    Beckmann, N., Kriegel, H.P., Schneider, R., Seeger, B.: The R*-tree: an efficient and robust access method for points and rectangles. In: Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data, vol. 19, issue 2. ACM (1990).
  11. 11.
    Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975). Scholar
  12. 12.
    Kjolstad, F., Snir, M.: Ghost cell pattern. In: 2nd Annual Conference on Parallel Programming Patterns (ParaPLoP), 30–31 March 2010, Carefree, AZ. ACM (2010).
  13. 13.
    Xu, X., Jäger, J., Kriegel, H.P.: A fast parallel clustering algorithm for large spatial databases. Data Min. Knowl. Disc. 3(3), 263–290 (1999). Scholar
  14. 14.
    Dagum, L., Menon, R.: OpenMP: an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998). Scholar
  15. 15.
    MPI Forum: MPI: A Message-Passing Interface Standard. Version 3.0, September 2012.
  16. 16.
    OpenSFS and EOFS: Lustre (2017).
  17. 17.
    IBM: General Parallel File System Knowledge Center (2017).
  18. 18.
    Folk, M., Heber, G., Koziol, Q., Pourmal, E., Robinson, D.: An overview of the HDF5 technology suite and its applications. In: Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases, pp. 36–47. ACM (2011).
  19. 19.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating Systems Design and Implementation, Berkeley, CA, USA. USENIX Association (2004).
  20. 20.
    Apache Software Foundation: Apache Spark (2017).
  21. 21.
    Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. USENIX Association (2012)Google Scholar
  22. 22.
    Jha, S., Qiu, J., Luckow, A., Mantha, P., Fox, G.C.: A tale of two data-intensive paradigms: applications, abstractions, and architectures. In: 2014 IEEE International Congress on Big Data, pp. 645–652. IEEE (2014).
  23. 23.
    MacQueen, J., et al.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, University of California Press, Berkeley, California, pp. 281–297 (1967)Google Scholar
  24. 24.
    Ganis, G., Iwaszkiewicz, J., Rademakers, F.: Data analysis with PROOF. In: Proceedings of XII International Workshop on Advanced Computing and Analysis Techniques in Physics Research. Number PoS(ACAT08)007 in Proceedings of Science (PoS) (2008)Google Scholar
  25. 25.
    Wang, Y., Goldstone, R., Yu, W., Wang, T.: Characterization and optimization of memory-resident MapReduce on HPC systems. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp. 799–808. IEEE (2014).
  26. 26.
    Kriegel, H.P., Schubert, E., Zimek, A.: The (black) art of runtime evaluation: are we comparing algorithms or implementations? Knowl. Inf. Syst. 52(2), 341–378 (2016). Scholar
  27. 27.
    Schubert, E., Koos, A., Emrich, T., Züfle, A., Schmid, K.A., Zimek, A.: A framework for clustering uncertain data. PVLDB 8(12), 1976–1979 (2015). Scholar
  28. 28.
    Patwary, M.M.A., Palsetia, D., Agrawal, A., Liao, W.k., Manne, F., Choudhary, A.: A new scalable parallel DBSCAN algorithm using the disjoint-set data structure. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–11. IEEE (2012).
  29. 29.
    Götz, M., Bodenstein, C., Riedel, M.: HPDBSCAN: highly parallel DBSCAN. In: Proceedings of the Workshop on Machine Learning in High-Performance Computing Environments, held in conjunction with SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis. ACM (2015).
  30. 30.
    Patwary, M.M.A.: PDSDBSCAN source code (2017).
  31. 31.
    Götz, M.: HPDBSCAN source code. Bitbucket repository (2016).
  32. 32.
    Baldridge, J.: ScalaNLP/Nak source code. GitHub repository (2015).
  33. 33.
    Busa, N.: Clustering geolocated data using Spark and DBSCAN. O’Reilly (2016).
  34. 34.
    Han, D., Agrawal, A., Liao, W.K., Choudhary, A.: A novel scalable DBSCAN algorithm with Spark. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, pp. 1393–1402. IEEE (2016).
  35. 35.
    Litouka, A.: Spark DBSCAN source code. GitHub repository (2017).
  36. 36.
    Stack Overflow: Apache Spark distance between two points using squaredDistance. Stack Overflow discussion (2015).
  37. 37.
    Cordova, I., Moh, T.S.: DBSCAN on Resilient Distributed Datasets. In: 2015 International Conference on High Performance Computing and Simulation (HPCS), pp. 531–540. IEEE (2015).
  38. 38.
    Cordova, I.: RDD DBSCAN source code. GitHub repository (2017).
  39. 39.
    He, Y., Tan, H., Luo, W., Feng, S., Fan, J.: MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data. Front. Comput. Sci. 8(1), 83–99 (2014). Scholar
  40. 40.
    aizook: Spark\(\_\)DBSCAN source code. GitHub repository (2014).
  41. 41.
    Raad, M.: DBSCAN On Spark source code. GitHub repository (2016).
  42. 42.
    He, Y., Tan, H., Luo, W., Mao, H., Ma, D., Feng, S., Fan, J.: MR-DBSCAN: an efficient parallel density-based clustering algorithm using MapReduce. In: 2011 IEEE 17th International Conference on Parallel and Distributed Systems (ICPADS), pp. 473–480. IEEE (2011).
  43. 43.
    Dai, B.R., Lin, I.C.: Efficient map/reduce-based DBSCAN algorithm with optimized data partition. In: 2012 IEEE Fifth International Conference on Cloud Computing, pp. 59–66. IEEE (2012).
  44. 44.
    Bodenstein, C.: HPDBSCAN Benchmark test files. EUDAT B2SHARE record (2015).
  45. 45.
    Amdahl, G.M.: Validity of the single processor approach to achieving large scale computing capabilities. In: AFIPS Conference Proceedings Volume 30: 1967 Spring Joint Computer Conference, pp. 483–485. American Federation of Information Processing Societies (1967).

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, School of Engineering and Natural SciencesUniversity of IcelandReykjavikIceland

Personalised recommendations