Abstract
The Hadoop platform for Map-Reduce is extensively for Big Data batch analytics as well as interactive applications in e-commerce, telecom, media, retail, social networking, and other areas. However, to date no industry standard benchmarks exist to evaluate the true performance of a Hadoop cluster.
Current Hadoop benchmarks such as HiBench, Terasort, etc. in the open source domain fail to capture the real usages and performance of a Hadoop cluster in a datacenter. Given that typical Hadoop deployments process jobs under strict Service Level Agreement requirements, benchmarks are needed to evaluate the effects of concurrently running such diverse analytics jobs for performance comparison and cluster configuration.
In this paper, we present the methodology and the development of a customer usage representative Hadoop benchmark which includes a mix of job types, variety of data sizes, with inter-job arrival times as in a typical datacenter. We present the details of this benchmark and discuss application level, micro-architectural and cluster level performance characterization on an Intel Sandy Bridge Xeon Processor Hadoop cluster.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI (2004)
Chen, Y., Asplaugh, S., Katz, R.: Interactive analytical processing in big data systems: a cross industry study of MapReduce workloads. In: International Conference on Very Large Data Bases (VLDB), Aug 2012
Chen, Y., Ganapathi, Griffith, R., Katz, R.; The case for evaluating MapReduce performance using workload suites. In: 19th IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS) (2011)
Baru, C., Bhandarkar, M., Nambiar, R., Poess, M., Rabl, T.: Benchmarking Big Data systems and the BigData Top100 list. Big Data (IMPETUS Innov. Archit.) 1(1), 60–64 (2013)
Huang, S., Huang, J., Dai, J., Xie, T., Huang, B., The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: ICDEW (2010)
Wiki, PigMix Benchmark. http://wiki.apache.org/pig/PigMix
GridMix3 – Emulating Production Workload for Apache Hadoop. https://git.apache.org/hadoop-mapreduce.git/src/contrib/gridmix
STAC: Comparison of IBM Platform Symphony and Apache Hadoop Using Berkeley SWIM. STAC, LLC. Nov 2012
SWIMProjectUCB, 2012. https://github.com/SWIMProjectUCB/SWIM/wiki
Jia, Y., Shao, Z.: A Benchmark for Hive, Pig, and Hadoop. https://issues.apache.org/jira/browse/hive-396
Thusoo, A., Sen-Sarma, J., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive - a warehouse solution over a Map-Reduce framework. In: VLDB (2009)
Zujie, R., Xu, X., Wan, J., Shi, W., Zhou, M.: Workload Characterization on a Production Hadoop Cluster: A Case Study on Taobao. In: IISWC (2012)
TPC-C Benchmark. http://www.tpc.org
Poess, M., Floyd, C.: New TPC benchmarks for decision support and Web commerce. In: SIGMOD (2000)
The HiBench Suite. https://github.com/intel-hadoop/HiBench
Wikipedia. Gamma Distribution. http://en.wikipedia.org/wiki/Gamma_distribution
Krishnan, K., Saletore, V.A.: Sysviz: system visualizer for cluster performance characterization. Internal report, Intel. Corp (2012)
Acknowledgements
Karthik Krishnan, now with Amazon Web Services at Amazon.com Inc. contributed to the significant development of this benchmark when he was at Intel. We would also like to thank our manager, Intel Fellow and Chief Server Architect of the Data Center Group, Dr. Faye Briggs for encouraging us to develop this benchmark for platform performance architectures projections.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Saletore, V.A., Krishnan, K., Viswanathan, V., Tolentino, M.E. (2014). HcBench: Methodology, Development, and Full-System Characterization of a Customer Usage Representative Big Data/Hadoop Benchmark. In: Rabl, T., Raghunath, N., Poess, M., Bhandarkar, M., Jacobsen, HA., Baru, C. (eds) Advancing Big Data Benchmarks. WBDB WBDB 2013 2013. Lecture Notes in Computer Science(), vol 8585. Springer, Cham. https://doi.org/10.1007/978-3-319-10596-3_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-10596-3_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10595-6
Online ISBN: 978-3-319-10596-3
eBook Packages: Computer ScienceComputer Science (R0)