Advertisement

A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters

  • Nusrat Sharmin Islam
  • Xiaoyi Lu
  • Md. Wasi-ur-Rahman
  • Jithin Jose
  • Dhabaleswar K. (DK) Panda
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8163)

Abstract

Hadoop Distributed File System (HDFS) is the primary storage system of Hadoop. Many applications use HDFS as the underlying file system due to its portability and fault-tolerance. The most popular benchmark to measure the I/O performance of HDFS is TestDFSIO which involves the MapReduce framework. However, there is a lack of standardized benchmark suite that can help users evaluate the performance of standalone HDFS and make comparisons for different networks and cluster configurations. In this paper, we design and develop a micro-benchmark suite that can be used to evaluate performance of HDFS operations. This paper also illustrates how this benchmark suite can be used to evaluate the performance results of HDFS installations over different networks/protocols and parameter configurations on modern clusters.

Keywords

Big Data Hadoop HDFS Micro-benchmarks Clusters and Networks 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Hadoop-RDMA: High-Performance Design of Hadoop over RDMA-enabled Interconnects, http://hadoop-rdma.cse.ohio-state.edu/
  2. 2.
    InfiniBand Trade Association, http://www.infinibandta.com
  3. 3.
  4. 4.
  5. 5.
    TPC Benchmark H - Standard Specication, http://www.tpc.org/tpch
  6. 6.
  7. 7.
  8. 8.
  9. 9.
    Balaji, P., Shah, H.V., Panda, D.K.: Sockets vs RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck. In: Workshop on Remote Direct Memory Access (RDMA): Applications, Implementations, and Technologies (RAIT), in Conjunction with IEEE Cluster (2004)Google Scholar
  10. 10.
    Bennett, C., Grossman, R.L., Locke, D., Seidman, J., Vejcik, S.: Malstone: towards a Benchmark for Analytics on Large Data Clouds. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2010, pp. 145–152. ACM, New York (2010)Google Scholar
  11. 11.
    Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.: Bigtable: A Distributed Storage System for Structured Data. In: The Proceedings of the Seventh Symposium on Operating System Desgin and Implementation (OSDI 2006), WA (November 2006)Google Scholar
  12. 12.
    Cooper, B.F., Silberstein, A., Tam, E., Ramakrishnan, R., Sears, R.: Benchmarking Cloud Serving Systems with YCSB. In: The Proceedings of the ACM Symposium on Cloud Computing (SoCC 2010), Indianapolis, Indiana, June 10-11 (2010)Google Scholar
  13. 13.
    Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: OSDI 2004: Proceedings of the 6th conference on Symposium on Opearting Systems Design and Implementation. USENIX Association (2004)Google Scholar
  14. 14.
    Floratou, A., Teletia, N., DeWitt, D.J., Patel, J.M., Zhang, D.: Can the Elephants Handle the NoSQL Onslaught? In: The Proceedings of the VLDB Endowment, VLDB 2012 (2012)Google Scholar
  15. 15.
    Ghemawat, S., Gobioff, H., Leung, S.: The Google File System. In: The Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP 2003), NY, USA, October 19-22 (2003)Google Scholar
  16. 16.
    Huang, J., Ouyang, X., Jose, J.: High-Performance Design of HBase with RDMA over InfiniBand. In: IEEE Int’l Parallel and Distributed Processing Symposium, IPDPS 2011 (May 2011)Google Scholar
  17. 17.
    Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The HiBench Benchmark Suite: Characterization of the MapReduce-based Data Analysis. In: IEEE 26th International Conference on Data Engineering Workshops, ICDEW (2010)Google Scholar
  18. 18.
    Infiniband Trade Association, http://www.infinibandta.org
  19. 19.
  20. 20.
    Islam, N.S., Rahman, M.W., Jose, J., Rajachandrasekar, R., Wang, H., Subramoni, H., Murthy, C., Panda, D.K.: High Performance RDMA-based Design of HDFS over InfiniBand. In: The International Conference for High Performance Computing, Networking, Storage and Analysis, SC (November 2012)Google Scholar
  21. 21.
    Shafer, J., Cox, S.R.: A.L.: The Hadoop Distributed Filesystem: Balancing Portability and Performance. In: The Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2010), White Plains, NY, March 28-30 (2010)Google Scholar
  22. 22.
    Kim, K., Jeon, K., Han, H.: MRBench: A Benchmark for MapReduce Framework. In: 14th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2008, pp. 11–18 (2008)Google Scholar
  23. 23.
    OpenFabrics Alliance, http://www.openfabrics.org/
  24. 24.
    Owen, O’Malley: Terabyte sort on apache hadoop, http://sortbenchmark.org/Yahoo-Hadoop.pdf
  25. 25.
    Patil, S., Polte, M., Ren, K., Tantisiriroj, W., Xiao, L., López, J., Gibson, G., Fuchs, A., Rinaldi, B.: YCSB++: Benchmarking and Performance Debugging Advanced Features in Scalable Table Stores. In: Proceedings of the 2nd ACM Symposium on Cloud Computing, SOCC 2011, pp. 9:1–9:14. ACM, New York (2011)Google Scholar
  26. 26.
    Rabl, T., Sadoghi, M., Jacobsen, H.-A., Gómez-Villamor, S., Muntés-Mulero, V., Mankovskii, S.: Solving Big Data Challenges for Enterprise Application Performance Management. In: The Proceedings of the VLDB Endowment, VLDB 2012 (2012)Google Scholar
  27. 27.
    Rahman, M.W., Huang, J., Jose, J., Ouyang, X., Wang, H., Islam, N., Subramoni, H., Murthy, C., Panda, D.K.: Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks? In: IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS (April 2012)Google Scholar
  28. 28.
    RDMA Consortium: Architectural Specifications for RDMA over TCP/IP, http://www.rdmaconsortium.org/
  29. 29.
    Sangroya, A., Serrano, D., Bouchenak, S.: MRBS: Towards Dependability Benchmarking for Hadoop Mapreduce. In: Caragiannis, I., et al. (eds.) Euro-Par Workshops 2012. LNCS, vol. 7640, pp. 3–12. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  30. 30.
    Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop Distributed File System. In: IEEE 26th Symposium on Mass Storage Systems and Technologies, MSST (2010)Google Scholar
  31. 31.
    Subramoni, H., Lai, P., Luo, M., Panda, D.K.: RDMA over Ethernet - A Preliminary Study. In: Proceedings of the 2009 Workshop on High Performance Interconnects for Distributed Computing, HPIDC 2009 (2009)Google Scholar
  32. 32.
    Sur, S., Wang, H., Huang, J., Ouyang, X., Panda, D.K.: Can High Performance Interconnects Benefit Hadoop Distributed File System? In: Workshop on Micro Architectural Support for Virtualization, Data Center Computing, and Clouds, in Conjunction with MICRO 2010, Atlanta, GA (2010)Google Scholar
  33. 33.
    Wang, Y., Que, X., Yu, W., Goldenberg, D., Sehgal, D.: Hadoop Acceleration through Network Levitated Merge. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011 (2011)Google Scholar
  34. 34.
    White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Inc. (October 2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Nusrat Sharmin Islam
    • 1
  • Xiaoyi Lu
    • 1
  • Md. Wasi-ur-Rahman
    • 1
  • Jithin Jose
    • 1
  • Dhabaleswar K. (DK) Panda
    • 1
  1. 1.Department of Computer Science and EngineeringThe Ohio State UniversityUSA

Personalised recommendations