The Journal of Supercomputing

, Volume 72, Issue 12, pp 4573–4600 | Cite as

Characterizing and benchmarking stand-alone Hadoop MapReduce on modern HPC clusters

  • Dipti ShankarEmail author
  • Xiaoyi Lu
  • Md. Wasi-ur-Rahman
  • Nusrat Islam
  • Dhabaleswar K. Panda


With the emergence of high-performance data analytics, the Hadoop platform is being increasingly used to process data stored on high-performance computing clusters. While there is immense scope for improving the performance of Hadoop MapReduce (including the network-intensive shuffle phase) over these modern clusters, that are equipped with high-speed interconnects such as InfiniBand and 10/40 GigE, and storage systems such as SSDs and Lustre, it is essential to study the MapReduce component in an isolated manner. In this paper, we study popular MapReduce workloads, obtained from well-accepted, comprehensive benchmark suites, to identify common shuffle data distribution patterns. We determine different environmental and workload-specific factors that affect the performance of the MapReduce job. Based on these characterization studies, we propose a micro-benchmark suite that can be used to evaluate the performance of stand-alone Hadoop MapReduce, and demonstrate its ease-of-use with different networks/protocols, Hadoop distributions, and storage architectures. Performance evaluations with our proposed micro-benchmarks show that stand-alone Hadoop MapReduce over IPoIB performs better than 10 GigE by about 13–15 %, and the RDMA-enhanced hybrid MapReduce design can achieve up to 43 % performance improvement over default Hadoop MapReduce over IPoIB, in both shared-nothing and shared storage architectures.


Big Data Hadoop MapReduce Micro-benchmarks High-performance networks RDMA InfiniBand 


  1. 1.
  2. 2.
    Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th international conference on very large data bases. VLDB ’94, San Francisco, pp 487–499Google Scholar
  3. 3.
    Ahmad F, Lee S, Thottethodi M, Vijaykumar T (2012) PUMA: Purdue MapReduce Benchmarks SuiteGoogle Scholar
  4. 4.
    Ananthanarayanan G, Ghodsi A, Wang A, Borthakur D, Kandula S, Shenker S, Stoica I (2012) PACMan: coordinated memory caching for parallel jobs. In: Proceedings of the 9th USENIX conference on networked systems design and implementation. USENIX Association, p 20Google Scholar
  5. 5.
  6. 6.
    Apache Mahout:
  7. 7.
  8. 8.
    BigDataBench: A Big Data Benchmark Suite.
  9. 9.
    Chen Y, Ganapathi A, Griffith R, Katz R (2011) The case for evaluating MapReduce performance using workload suites. In: 2011 IEEE 19th international symposium on modeling, analysis simulation of computer and telecommunication systems. MASCOTS (July 2011), pp 390–399Google Scholar
  10. 10.
  11. 11.
  12. 12.
    Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th conference on Symposium on Operating Systems Design and Implementation. OSDI, San Francisco, CA (December 2004)Google Scholar
  13. 13.
  14. 14.
    GraySort and MinuteSort at Yahoo on Hadoop 0.23:
  15. 15.
  16. 16.
  17. 17.
    Guo Y, Rao J, Zhou X (2013) iShuffle: Improving Hadoop Performance with Shuffle-on-Write. Proceedings of the 10th international conference on autonomic computing (ICAC’ 13). USENIX, San Jose, pp 107–117Google Scholar
  18. 18.
  19. 19.
    High-Performance Big Data (HiBD).
  20. 20.
    Huang S, Huang J, Dai J, Xie T, Huang B (2010) The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: Proceedings of the 26th international conference on data engineering workshops. ICDEW, Long Beach, CA (March 2010)Google Scholar
  21. 21.
    International Data Corporation (IDC): New IDC Worldwide HPC End-User Study Identifies Latest Trends in High Performance Computing Usage and Spending.
  22. 22.
  23. 23.
    Islam NS, Lu X, Rahman W, Shankar D, Panda DK (2015) Triple-H: a hybrid approach to accelerate HDFS on HPC clusters with heterogeneous storage architecture . In: 15th IEEE/ACM international symposium on cluster, cloud and grid computing. CCGrid, Shenzhen, China (May 2015)Google Scholar
  24. 24.
    Islam NS, Lu X, Rahman MW, Jose J, Panda DK (2012) A micro-benchmark suite for evaluating HDFS operations on modern clusters. In: Proceedings of the 2nd workshop on Big Data benchmarking. WBDBGoogle Scholar
  25. 25.
    Islam NS, Rahman MW, Jose J, Rajachandrasekar R, Wang H, Subramoni H, Murthy C, Panda DK (2012) High performance RDMA-based design of HDFS over InfiniBand. In: Proceedings of the international conference on high performance computing, networking, storage and analysis. SC (November 2012)Google Scholar
  26. 26.
    Islam NS, Lu X, Rahman MWu, Panda DKD (2014) SOR-HDFS: a SEDA-based approach to maximize overlapping in RDMA-enhanced HDFS. In: Proceedings of the 23rd international symposium on high-performance parallel and distributed computing. HPDC ’14, ACM, New York, pp 261–264Google Scholar
  27. 27.
    Hartigan JA, MAW (1979) Algorithm as 136: a k-means clustering algorithm. J R Stat Soc Ser C (Appl Stat) 28(1):100–108.
  28. 28.
    Jia Z, Zhan J, Wang L, Han R, McKee SA, Yang Q, Luo C, Li J (2014) Characterizing and subsetting Big Data workloads. arXiv:1409.0792
  29. 29.
    Kang U, Tsourakakis CE, Faloutsos C (2009) PEGASUS: a peta-scale graph mining system - implementation and observation. In: Data mining, 2009. ICDM’09. Ninth IEEE international conference on IEEE, pp 229–238Google Scholar
  30. 30.
    Kim K, Jeon K, Han H, gyu Kim S, Jung H, Yeom H (2008) MRBench: a benchmark for MapReduce framework. In: Proceedings of the IEEE 14th international conference on parallel and distributed systems. ICPADS, Melbourne, Victoria, Australia (December 2008)Google Scholar
  31. 31.
    Kwon Y, Balazinska M, Howe B, Rolia J (2011) A study of skew in MapReduce applications. Open Cirrus SummitGoogle Scholar
  32. 32.
    Kwon Y, Ren K, Balazinska M, Howe B, Rolia J (2013) Managing skew in Hadoop. IEEE Data Eng Bull 36(1):24–33Google Scholar
  33. 33.
    Lu X, Islam NS, Rahman MW, Jose J, Subramoni H, Wang H, Panda DK (2013) High-performance design of Hadoop RPC with RDMA over InfiniBand. In: Proceedings of the IEEE 42th international conference on parallel processing. ICPP, LyonGoogle Scholar
  34. 34.
    Lu X, Islam NS, Wasi-Ur-Rahman M, Panda DK (2013) A micro-benchmark suite for evaluating Hadoop RPC on high-performance networks. In: Proceedings of the 3rd workshop on Big Data benchmarking. WBDB (May 2013)Google Scholar
  35. 35.
    Lu X, Rahman M, Islam N, Shankar D, Panda D (2014) Accelerating Spark with RDMA for Big Data processing: early experiences. In: High-performance interconnects (HOTI), 2014 IEEE 22nd annual symposium on, pp 9–16 (Aug 2014)Google Scholar
  36. 36.
    Lu X, Wang B, Zha L, Xu Z (2011) Can MPI benefit Hadoop and MapReduce applications? In: Proceedings of the IEEE 40th international conference on parallel processing workshops. ICPPW (September 2011)Google Scholar
  37. 37.
    Lustre filesystem:
  38. 38.
  39. 39.
    Ming Z, Luo C, Gao W, Han R, Yang Q, Wang L, Zhan J (2014) BDGS: a scalable Big Data generator suite in Big Data benchmarking. arXiv:1401.5465
  40. 40.
  41. 41.
  42. 42.
  43. 43.
  44. 44.
    Rahman MW, Lu X, Islam NS, Rajachadrasekar R, Panda DK (2015) High-performance design of YARN MapReduce on modern HPC clusters with Lustre and RDMA. In: 29th IEEE international parallel and distributed processing symposium. IPDPS (May 2015)Google Scholar
  45. 45.
    Rahman MW, Lu X, Islam N, Rajachandrasekar R, Panda D (2014) MapReduce over Lustre: Can RDMA-based approach benefit? In: Euro-Par 2014 parallel processing, lecture notes in computer science, vol 8632. Springer International Publishing (August 2014), pp 644–655Google Scholar
  46. 46.
    Rahman MW, Islam NS, Lu X, Jose J, Subramoni H, Wang H, Panda DK (2013) High-performance RDMA-based design of Hadoop MapReduce over InfiniBand. In: International workshop on high performance data intensive computing. HPDIC, Boston (May 2013)Google Scholar
  47. 47.
    Rahman MW, Lu X, Islam NS, Panda DK (2014) HOMR: a hybrid approach to exploit maximum overlapping in MapReduce over high performance interconnects. In: Proceedings of the 28th ACM international conference on supercomputing. ICS, ACM, Munich, pp 33–42 (June 2014)Google Scholar
  48. 48.
    Sangroya A, Serrano D, Bouchenak S (2013) MRBS: Towards dependability benchmarking for Hadoop MapReduce. In: Proceedings of the 18th international conference on parallel processing workshops. Euro-Par, Aachen (Aug 2013)Google Scholar
  49. 49.
    Shankar D, Lu X, Rahman MW, Islam N, Panda DK (2014) A Micro-benchmark Suite for Evaluating Hadoop MapReduce on high-performance networks. In: Proceedings of the fifth workshop on Big Data benchmarks, performance optimization, and emerging hardware, BPOE-5, vol 8807. Springer International Publishing, Hangzhou, pp 19–33 (Sep 2014)Google Scholar
  50. 50.
  51. 51.
  52. 52.
    Stanford Large Network Dataset Collection (SNAP):
  53. 53.
  54. 54.
    The Apache Software Foundation: Apache Hadoop.
  55. 55.
    Top500 Supercomputing System:
  56. 56.
    Wang L, Zhan J, Luo C, Zhu Y, Yang Q, He Y, Gao W, Jia Z, Shi Y, Zhang S, Zheng C, Lu G, Zhan K, Li X, Qiu B (2014) BigDataBench: a Big Data Benchmark Suite from Internet Services. In: Proceedings of the 20th IEEE international symposium on high performance computer architecture. HPCA, Orlando (Feb 2014)Google Scholar
  57. 57.
    Wang Y, Que X, Yu W, Goldenberg D, Sehgal D (2011) Hadoop acceleration through network levitated merge. In: Proceedings of international conference for high performance computing, networking, storage and analysis (SC). Seattle (Nov 2011)Google Scholar
  58. 58.
  59. 59.

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Dipti Shankar
    • 1
    Email author
  • Xiaoyi Lu
    • 1
  • Md. Wasi-ur-Rahman
    • 1
  • Nusrat Islam
    • 1
  • Dhabaleswar K. Panda
    • 1
  1. 1.Department of Computer Science and EngineeringThe Ohio State UniversityColumbusUSA

Personalised recommendations