Abstract
Hadoop Distributed File System (HDFS) is the primary storage system of Hadoop. Many applications use HDFS as the underlying file system due to its portability and fault-tolerance. The most popular benchmark to measure the I/O performance of HDFS is TestDFSIO which involves the MapReduce framework. However, there is a lack of standardized benchmark suite that can help users evaluate the performance of standalone HDFS and make comparisons for different networks and cluster configurations. In this paper, we design and develop a micro-benchmark suite that can be used to evaluate performance of HDFS operations. This paper also illustrates how this benchmark suite can be used to evaluate the performance results of HDFS installations over different networks/protocols and parameter configurations on modern clusters.
This research is supported in part by National Science Foundation grants #OCI-0926691, #OCI-1148371 and #CCF-1213084.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Hadoop-RDMA: High-Performance Design of Hadoop over RDMA-enabled Interconnects, http://hadoop-rdma.cse.ohio-state.edu/
InfiniBand Trade Association, http://www.infinibandta.com
RandomWriter, http://wiki.apache.org/hadoop/RandomWriter
TPC Benchmark H - Standard Specication, http://www.tpc.org/tpch
WordCount, http://wiki.apache.org/hadoop/WordCount
Apache Hadoop, http://hadoop.apache.org/
Apache HBase, http://hbase.apache.org
Balaji, P., Shah, H.V., Panda, D.K.: Sockets vs RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck. In: Workshop on Remote Direct Memory Access (RDMA): Applications, Implementations, and Technologies (RAIT), in Conjunction with IEEE Cluster (2004)
Bennett, C., Grossman, R.L., Locke, D., Seidman, J., Vejcik, S.: Malstone: towards a Benchmark for Analytics on Large Data Clouds. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2010, pp. 145–152. ACM, New York (2010)
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.: Bigtable: A Distributed Storage System for Structured Data. In: The Proceedings of the Seventh Symposium on Operating System Desgin and Implementation (OSDI 2006), WA (November 2006)
Cooper, B.F., Silberstein, A., Tam, E., Ramakrishnan, R., Sears, R.: Benchmarking Cloud Serving Systems with YCSB. In: The Proceedings of the ACM Symposium on Cloud Computing (SoCC 2010), Indianapolis, Indiana, June 10-11 (2010)
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: OSDI 2004: Proceedings of the 6th conference on Symposium on Opearting Systems Design and Implementation. USENIX Association (2004)
Floratou, A., Teletia, N., DeWitt, D.J., Patel, J.M., Zhang, D.: Can the Elephants Handle the NoSQL Onslaught? In: The Proceedings of the VLDB Endowment, VLDB 2012 (2012)
Ghemawat, S., Gobioff, H., Leung, S.: The Google File System. In: The Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP 2003), NY, USA, October 19-22 (2003)
Huang, J., Ouyang, X., Jose, J.: High-Performance Design of HBase with RDMA over InfiniBand. In: IEEE Int’l Parallel and Distributed Processing Symposium, IPDPS 2011 (May 2011)
Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The HiBench Benchmark Suite: Characterization of the MapReduce-based Data Analysis. In: IEEE 26th International Conference on Data Engineering Workshops, ICDEW (2010)
Infiniband Trade Association, http://www.infinibandta.org
IOzone, http://www.iozone.org/
Islam, N.S., Rahman, M.W., Jose, J., Rajachandrasekar, R., Wang, H., Subramoni, H., Murthy, C., Panda, D.K.: High Performance RDMA-based Design of HDFS over InfiniBand. In: The International Conference for High Performance Computing, Networking, Storage and Analysis, SC (November 2012)
Shafer, J., Cox, S.R.: A.L.: The Hadoop Distributed Filesystem: Balancing Portability and Performance. In: The Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2010), White Plains, NY, March 28-30 (2010)
Kim, K., Jeon, K., Han, H.: MRBench: A Benchmark for MapReduce Framework. In: 14th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2008, pp. 11–18 (2008)
OpenFabrics Alliance, http://www.openfabrics.org/
Owen, O’Malley: Terabyte sort on apache hadoop, http://sortbenchmark.org/Yahoo-Hadoop.pdf
Patil, S., Polte, M., Ren, K., Tantisiriroj, W., Xiao, L., López, J., Gibson, G., Fuchs, A., Rinaldi, B.: YCSB++: Benchmarking and Performance Debugging Advanced Features in Scalable Table Stores. In: Proceedings of the 2nd ACM Symposium on Cloud Computing, SOCC 2011, pp. 9:1–9:14. ACM, New York (2011)
Rabl, T., Sadoghi, M., Jacobsen, H.-A., Gómez-Villamor, S., Muntés-Mulero, V., Mankovskii, S.: Solving Big Data Challenges for Enterprise Application Performance Management. In: The Proceedings of the VLDB Endowment, VLDB 2012 (2012)
Rahman, M.W., Huang, J., Jose, J., Ouyang, X., Wang, H., Islam, N., Subramoni, H., Murthy, C., Panda, D.K.: Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks? In: IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS (April 2012)
RDMA Consortium: Architectural Specifications for RDMA over TCP/IP, http://www.rdmaconsortium.org/
Sangroya, A., Serrano, D., Bouchenak, S.: MRBS: Towards Dependability Benchmarking for Hadoop Mapreduce. In: Caragiannis, I., et al. (eds.) Euro-Par Workshops 2012. LNCS, vol. 7640, pp. 3–12. Springer, Heidelberg (2013)
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop Distributed File System. In: IEEE 26th Symposium on Mass Storage Systems and Technologies, MSST (2010)
Subramoni, H., Lai, P., Luo, M., Panda, D.K.: RDMA over Ethernet - A Preliminary Study. In: Proceedings of the 2009 Workshop on High Performance Interconnects for Distributed Computing, HPIDC 2009 (2009)
Sur, S., Wang, H., Huang, J., Ouyang, X., Panda, D.K.: Can High Performance Interconnects Benefit Hadoop Distributed File System? In: Workshop on Micro Architectural Support for Virtualization, Data Center Computing, and Clouds, in Conjunction with MICRO 2010, Atlanta, GA (2010)
Wang, Y., Que, X., Yu, W., Goldenberg, D., Sehgal, D.: Hadoop Acceleration through Network Levitated Merge. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011 (2011)
White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Inc. (October 2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Islam, N.S., Lu, X., Wasi-ur-Rahman, M., Jose, J., Panda, D.K.(. (2014). A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters. In: Rabl, T., Poess, M., Baru, C., Jacobsen, HA. (eds) Specifying Big Data Benchmarks. WBDB WBDB 2012 2012. Lecture Notes in Computer Science, vol 8163. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-53974-9_12
Download citation
DOI: https://doi.org/10.1007/978-3-642-53974-9_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-53973-2
Online ISBN: 978-3-642-53974-9
eBook Packages: Computer ScienceComputer Science (R0)