Big Data Analytics in Java with PCJ Library: Performance Comparison with Hadoop

  • Marek Nowicki
  • Magdalena Ryczkowska
  • Łukasz Górski
  • Piotr BalaEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10778)


The focus of this article is to present Big Data analytics using Java and PCJ library. The PCJ library is an award-winning library for development of parallel codes using PGAS programming paradigm. The PCJ can be used for easy implementation of the different algorithms, including ones used for Big Data processing. In this paper, we present performance results for standard benchmarks covering different types of applications from computational intensive, through traditional map-reduce up to communication intensive. The performance is compared to one achieved on the same hardware but using Hadoop. The PCJ implementation has been used with both local file system and HDFS. The code written with the PCJ can be developed much faster as it requires a smaller number of libraries used. Our results show that applications developed with the PCJ library are much faster compare to Hadoop implementation.


Big Data Java Parallel computing Hadoop 



The authors would like to thank CHIST-ERA consortium for financial support under HPDCJ project. The Polish contribution is financed through NCN grant 2014/14/Z/ST6/00007. The performance tests have been performed using ICM University of Warsaw computational facilities.


  1. 1.
    Tsai, C.-W., Lai, C.-F., Chao, H.-C., Vasilakos, A.V.: Big data analytics: a survey. J. Big Data 2, 21 (2015)CrossRefGoogle Scholar
  2. 2.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In Proceedings of OSDI (2004)Google Scholar
  3. 3.
    Apache Hadoop. Accessed 20 Sept 2017
  4. 4. Accessed 20 Sept 2017
  5. 5.
    Nowicki, M., Górski, Ł., Grabarczyk, P., Bała, P.: PCJ - Java library for high performance computing in PGAS model. In: Smari, W.W., Zeljkovic, V. (eds.) 2014 International Conference on High Performance Computing and Simulation (HPCS), pp. 202–209. IEEE (2014)Google Scholar
  6. 6.
    Li, Z., Shen, H., Ligon, W.B., Denton, J.: An exploration of designing a hybrid scale-up/out hadoop architecture based on performance measurements. IEEE Trans. Parallel Distrib. Syst. 99, 1–1 (2016)CrossRefGoogle Scholar
  7. 7.
    Tolstoy, L.: War and Peace. Random House, Newyork (2016)Google Scholar
  8. 8.
    de Scudéry, M.: Artamène ou le grand Cyrus (1972)Google Scholar
  9. 9.
    Ibrahim, S., Jin, H., Lu, L., Qi, L., Wu, S., Shi, X.: Evaluating mapreduce on virtual machines: the hadoop case. In: Jaatun, M.G., Zhao, G., Rong, C. (eds.) CloudCom 2009. LNCS, vol. 5931, pp. 519–528. Springer, Heidelberg (2009). CrossRefGoogle Scholar
  10. 10.
  11. 11.
    Murphy, R.C., Wheeler, K.B., Barrett, B.W., Ang, J.A.: Introducing the graph 500. Cray Users’ Group (CUG) 19, 45–74 (2010)Google Scholar
  12. 12.
    Ueno, K., Suzumura, T.: Highly scalable graph search for the Graph500 benchmark, In: Proceedings of the 21st International ACM Symposium on High-Performance Parallel and Distributed Computing, pp. 149–160 (2012)Google Scholar
  13. 13.
    Buluc, A., Madduri, K.: Parallel breadth-first search on distributed memory systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (2011)Google Scholar
  14. 14.
    Ryczkowska, M., Nowicki, M., Bała, P.: Level-synchronous BFS algorithm implemented in Java using PCJ library. In: 2016 International Conference on Computational Science and Computational Intelligence (CSCI), Las Vegas, NV, USA, pp. 596–601 (2016)Google Scholar
  15. 15.
  16. 16.
    Li, Z., Shen, H., Denton, J., Ligon, W.: Comparing application performance on HPC-based hadoop platforms with local storage and dedicated storage. In: 2016 IEEE International Conference on Big Data (Big Data), pp. 233–242 (2016)Google Scholar
  17. 17.
    Augustine, D.P., Raj, P.: Performance evaluation of parallel genetic algorithm for brain MRI segmentation in hadoop and spark. Indian J. Sci. Technol. (2016). Accessed 24 Mar 2017
  18. 18.
    Islam, N.S., Wasi-ur-Rahman, M., Lu, X., Panda, D.K.D.K.: Efficient data access strategies for Hadoop and Spark on HPC cluster with heterogeneous storage. In: 2016 IEEE International Conference on Big Data (Big Data), Washington, DC, pp. 223–232 (2016)Google Scholar
  19. 19.
    He, H., Du, Z., Zhang, W., Chen, A.: Optimization strategy of Hadoop small file storage for big data in healthcare. J. Supercomput. 72, 3696–3707 (2016)CrossRefGoogle Scholar
  20. 20.
    Park, D., Wang, J., Kee, Y.S.: In-storage computing for Hadoop mapreduce framework: challenges and possibilities. IEEE Trans. Comput. PP(99), 1–1 (2016)CrossRefGoogle Scholar
  21. 21.
    Maltzahn, C., Molina-Estolano, E., Khurana, A., Nelson, A., Brandt, S., Weil, S.: Ceph as a scalable alternative to the Hadoop Distributed File System. The USENIX Mag. 4(35), 518–529 (2010)Google Scholar
  22. 22.
    Ghemawat, S., Gobioff, H., Leung, S.-T.: The google file system. In Proceedings of ACM SOSP (2003)Google Scholar
  23. 23.
    Carsn, P.H., Ligon, W.B., Ross, R.B., Thakur, R.: PVFS: a parallel file system for linux clusters (2000)Google Scholar
  24. 24.
    Yang, S., Ligon, W., Quarles, E.: Scalable distributed directory implementation on orange file system. In: Proceedings of SNAPI (2011)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Marek Nowicki
    • 1
  • Magdalena Ryczkowska
    • 1
    • 2
  • Łukasz Górski
    • 1
    • 2
  • Piotr Bala
    • 2
    Email author
  1. 1.Faculty of Mathematics and Computer ScienceNicolaus Copernicus UniversityTorunPoland
  2. 2.Interdisciplinary Centre for Mathematical and Computational ModelingUniversity of WarsawWarsawPoland

Personalised recommendations