Abstract
Performance and scalability in clusters of heterogeneous and complex Big Data Analytic environments are always unpredictable. In this paper, we are trying to address this problem by using a benchmark named “Big DS”. The benchmark adopts many great ideas from some famous industry benchmarks like TPC-H [1], TPC-DS [1], SPECvirt_sc2010 [2] and SPECjbb2005 [2], we also adopt some ideas from non-standard benchmarks liked TeraSort [3], SWIM [4], etc. By defining a configurable workload for different big data analytics environment, Big DS can be used for measuring the performance and scalability of a big data analytics platform or environment for different business.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
TPC. TPC is a trademark of the Transaction Processing Performance Council. TPC-H and TPC-DS are the decision support benchmarks of TPC organization. http://www.tpc.org
SPEC. SPEC is a trademark of the Standard Performance Evaluation Corporation 1995–2014. SPECjbb2005 is the server side Java Benchmark of SPEC.org. SPECjbb2013 is the evaluation version of SPECjbb2005. SPECvirt_2010sc is the server consolidation virtualization benchmark of SPEC.org. http://www.spec.org
TeraSort. Refer to the Apache Terasort benchmark, which is a MapReduce version of Sort benchmark
SWIM. SWIM stands for Statistical Workload Injector for MapReduce. The synthesis methodology is adopted in BigDS and it’s supporting toolset
Apache Hadoop and it’s related projects. Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware. Hadoop is an Apache top-level project being built and used by a global community of contributors and users. It is licensed under the Apache License 2.0
Apache Hive. Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. [1] While initially developed by Facebook
Cloudera Impala. loudera Impala is Cloudera’s open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html
Google BigTable. Refer to Google’s BigTable paper. http://research.google.com/archive/bigtable-osdi06.pdf
Huppler, K.: Chairman TPC. The author of “The art of building a good benchmark” (2009). http://www.tpc.org/tpctc/tpctc2009/tpctc2009-03.pdf
WBDB, Workshop of Big Data Benchmarking, San Jose. http://clds.ucsd.edu/wbdb2012
Big Bench. Extend TPC-DS specification to include unstructured and semi-structured data; modify the TPC-DS. In: A data model for BigBench was proposed in the First WBDB Workshop by Ghazal (2012)
Deep Analytic Pipeline. A Benchmark Proposal by Milind Bhandarkar (Pivotal Chief Scientist), (2013). http://clds.sdsc.edu/sites/clds.sdsc.edu/files/2013-03-07-DeepAnalyticsPipeline.pdf
Apache Drill Project. Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Drill is the open source version of Google’s Dremel system which is available as an IaaS service called Google BigQuery. http://incubator.apache.org/drill/
Google Dremel. http://research.google.com/pubs/pub36632.html
Google Big Query. https://developers.google.com/bigquery/
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Zhao, JM., Wang, WS., Liu, X., Chen, YF. (2014). Big Data Benchmark - Big DS. In: Rabl, T., Raghunath, N., Poess, M., Bhandarkar, M., Jacobsen, HA., Baru, C. (eds) Advancing Big Data Benchmarks. WBDB WBDB 2013 2013. Lecture Notes in Computer Science(), vol 8585. Springer, Cham. https://doi.org/10.1007/978-3-319-10596-3_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-10596-3_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10595-6
Online ISBN: 978-3-319-10596-3
eBook Packages: Computer ScienceComputer Science (R0)