- 223 Downloads
The meaning of the word benchmark is (Andersen and Pettersen 1995) A predefined position, used as a reference point for taking measures against. There is no clear formal definition of analytics benchmarks.
Jim Gray (1992) describes the benchmarking as follows: “This quantitative comparison starts with the definition of a benchmark or workload. The benchmark is run on several different systems, and the performance and price of each system is measured and recorded. Performance is typically a throughput metric (work/second) and price is typically a five-year cost-of-ownership metric. Together, they give a price/performance ratio.” In short, we define that a software benchmark is a program used for comparison of software products/tools executing on a pre-configured hardware environment.
Analytics benchmarks are a type of domain-specific benchmark targeting analytics for databases, transaction processing, and big data systems. Originally, the TPC (Transaction Processing Performance Council) (TPC 2018) defined online transaction processing (OLTP) (TPC-A and TPC-B) and decision support (DS) benchmarks (TPC-D and TPC-H). The DS systems can be seen as some sort of special online analytical processing (OLAP) system with an example of the TPC-DS benchmark, which is a successor of TPC-H (Nambiar and Poess 2006) and specifies many OLAP and data mining queries, which are the predecessors of the current analytics benchmarks. However, due to the many new emerging data platforms like hybrid transaction/analytical processing (HTAP) (Kemper and Neumann 2011; Özcan et al. 2017), distributed parallel processing engines (Sakr et al. 2013; Hadoop 2018; Spark 2018; Flink 2018; Carbone et al. 2015, etc.), Big data management (AsterixDB 2018; Alsubaiee et al. 2014), SQL-on-Hadoop-alike (Abadi et al. 2015; Hive 2018; Thusoo et al. 2009; SparkSQL 2018; Armbrust et al. 2015; Impala 2018; Kornacker et al. 2015, etc.), and analytics systems (Hu et al. 2014) integrating machine learning (MLlib 2018; Meng et al. 2016; MADlib 2018; Hellerstein et al. 2012), Deep Learning (Tensorflow 2018) and more, the emerging benchmarks try to follow the trend to stress these new system features. This makes the currently standardized benchmarks (such as TPC-C, TPC-H, etc.) only partially relevant for the emerging big data management systems as they offer new features that require new analytics benchmarks.
This chapter reviews the evolution of the analytics benchmarks and their current state today (as of 2017). It starts overview of the most relevant benchmarking organizations their benchmark standards and outlines the latest benchmark development and initiatives targeting the emerging Big Data Analytics systems. Last but not least the typical benchmark components are described as well as the different goals that these benchmarks try to achieve.
OLTP and DSS/OLAP
In the end of the 1970s, many businesses started implementing transaction-based systems (Rockart et al. 1982), which later became known as the term online transaction processing (OLTP) systems and represent the instant interaction between the user and the data management system. This type of transaction processing systems became a key part of the companies’ operational infrastructure and motivated TPC (TPC 2018) to target these systems in their first formal benchmark specification. At the same time, the decision support systems (DSS) evolved significantly and became a standard tool for the enterprises that assisted in the human decision-making (Shim et al. 2002).
In the early 1990s, a different type of system, called online analytical processing (OLAP) systems by Codd et al. (1993), was used by the enterprises to dynamically manipulate and synthesize historic information. The historic data was aggregated from the OLTP systems, and through the application of dynamic analysis, the users were able to gain important knowledge for the operational activities over longer periods of time.
Over the years, the DS systems were enhanced by the use of the OLAP systems (Shim et al. 2002). They became an essential decision-making tool for the enterprise management and a core element of the company infrastructure. With the wide adaption of multipurpose database systems to build both an OLTP and DS system, the need for standardized database benchmarks arose. This resulted in an intense competition between database vendors to dominate the market, which leads to the need of domain-specific benchmarks to stress the database software. The use of sample workloads together with a bunch of metrics was not enough to guarantee the product capabilities in a transparent way. Another arising issue was the use of benchmarks for benchmarketing. It happens when a company uses a particular benchmark to highlight the strengths of its product and hide its weaknesses and then promotes the benchmark as a “standard,” often without disclosing the details of the benchmark (Gray 1992). All of these opened the gap for standardized benchmarks that are formally specified by recognized expert organizations. Therefore, a growing number of organizations are working on defining and standardizing of benchmarks. They operate as consortia of public and private organizations and define domain-specific benchmarks, price, and performance metrics, measuring and reporting rules as well as formal validation and auditing rules.
Active TPC benchmarks (TPC 2018)
Transaction processing (OLTP)
Decision support (OLAP)
TPC-H, TPC-DS, TPC-DI
TPC-VMS, TPCx-V, TPCx-HCI
TPCx-HS V1, TPCx-HS V2, TPCx-BB, TPC-DS V2
Active SPEC benchmarks (SPEC 2018)
SPEC cloud IaaS 2016
SPEC CPU2006, SPEC CPU2017
Graphics and workstation performance
SPECapc for solidWorks 2015, SPECapc for siemens NX 9.0 and 10.0, SPECapc for PTC Creo 3.0, SPECapc for 3ds Max 2015, SPECwpc V2.1, SPECviewperf 12.1
High-performance computing, OpenMP, MPI, OpenACC, OpenCL
SPEC OMP2012, SPEC MPI2007, SPEC ACCEL
SPECjvm2008, SPECjms2007, SPECjEnterprise2010, SPECjbb2015
SPEC VIRT SC 2013
Active STAC benchmarks (STAC 2018)
Big Data Technologies
In the recent years, many emerging data technologies have become popular, trying to solve the challenges posed by the new big data and Internet of things application scenarios. In a historical overview of the trends in data management technologies, Nambiar et al. (2012) highlight the role of big data technologies and how they are currently changing the industry. One such technology is the NoSQL storage engines (Cattell 2011) which relax the ACID (atomicity, consistency, isolation, durability) guarantees but offer faster data access via distributed and fault-tolerant architecture. There are different types of NoSQL engines (key value, column, graph, and documents stores) covering different data representations.
In the meantime, many new dig data technologies such as (1) Apache Hadoop (2018) with HDFS and MapReduce; (2) general parallel processing engines like Spark (2018) and Flink (2018); (3) SQL-on-Hadoop systems like Hive (2018) and Spark SQL (2018); (4) real-time stream processing engines like Storm (2018), Spark Streaming (2018) and Flink; and (5) graph engines on top of Hadoop like GraphX (2018) and Flink Gelly (2015) have emerged. All these tools enabled advanced analytical techniques from data science, machine learning, data mining, and deep learning to become common practices in many big data domains. Because of all these analytical techniques, which are currently integrated in many different ways in both traditional database and new big data management systems, it is hard to define the exact features that a successor of the DS/OLAP systems should have.
Big Data Analytics Benchmarks
A benchmark designed to compare emerging cloud serving systems like Cassandra, HBase, MongoDB, Riak, and many more, which do not support ACID. It provides a core package of six predefined workloads A–F, which simulate a cloud OLTP application
LinkBench (Armstrong et al. 2013)
A benchmark, developed by Facebook, using synthetic social graph to emulate social graph workload on top of databases such as MySQL and MongoDB
MRBench (Kim et al. 2008)
Implementing the TPC-H benchmark queries directly in map and reduce operations
CALDA (Pavlo et al. 2009)
It consists of five tasks defined as SQL queries among which is the original MR Grep task, which is a representative for most real user MapReduce programs
AMP lab big data benchmark (AMPLab 2013)
A benchmark based on CALDA and HiBench, implemented on five SQL-on-Hadoop engines (RedShift, Hive, Stinger/Tez, Shark, and Impala)
BigBench (Ghazal et al. 2013)
An end-to-end big data benchmark that represents a data model simulating the volume, velocity, and variety characteristics of a big data system, together with a synthetic data generator for structured, semi-structured, and unstructured data, consisting of 30 queries
BigFrame (BigFrame 2013)
BigFrame is a benchmark generator offering a benchmarking-as-a-service solution for big data analytics
PRIMEBALL (Ferrarons et al. 2013)
A novel and unified benchmark specification for comparing the parallel processing frameworks in the context of big data applications hosted in the cloud. It is implementation- and technology-agnostic, using a fictional news hub called New Pork Times, based on a popular real-life news site
BigFUN (Pirzadeh et al. 2015)
It is based on a social network use case with synthetic semi- structured data in JSON format. The benchmark focuses exclusively on micro-operation level and consists of queries with various operations such as simple retrieves, range scans, aggregations, and joins, as well as inserts and updates
BigBench V2 (Ghazal et al. 2017)
BigBench V2 separates from TPC-DS with a simple data model, consisting only of six tables. The new data model still has the variety of structured, semi-structured, and unstructured data as the original BigBench data model. The semi-structured data (weblogs) are generated in JSON logs. New queries replace all the TPC-DS queries and preserve the initial number of 30 queries
Big data benchmark suites
MRBS (Sangroya et al. 2012)
A comprehensive benchmark suite for evaluating the performance of MapReduce systems in five areas: recommendations, BI (TPC-H), bioinformatics, text processing, and data mining
HiBench (Huang et al. 2010)
A comprehensive benchmark suite consisting of multiple workloads including both synthetic micro-benchmarks and real-world applications. It features several ready-to-use benchmarks from 4 categories: micro benchmarks, Web search, machine learning, and HDFS benchmarks
CloudSuite (Ferdman et al. 2012)
A benchmark suite consisting of both emerging scale-out workloads and traditional benchmarks. The goal of the benchmark suite is to analyze and identify key inefficiencies in the processors core micro-architecture and memory system organization when running todays cloud workloads
CloudRank-D (Luo et al. 2012)
A benchmark suite for evaluating the performance of cloud computing systems running big data applications. The suite consists of 13 representative data analysis tools, which are designed to address a diverse set of workload data and computation characteristics (i.e., data semantics, data models and data sizes, the ratio of the size of data input to that of data output)
BigDataBench (Wang et al. 2014)
An open-source big data benchmark suite consisting of 15 data sets (of different types) and more than 33 workloads. It is a large effort organized in China available with a toolkit that adopts different other benchmarks
SparkBench, developed by IBM, is a comprehensive Spark-specific benchmark suite that comprises of four main workload categories: machine learning, graph processing, streaming, and SQL queries.
Preamble – Defines the benchmark domain and the high level requirements.
Database Design – Defines the requirements and restrictions for implementing the database schema.
Workload – Characterizes the simulated workload.
ACID – Atomicity, consistency, isolation, and durability requirements.
Workload scaling – Defines tools and methodology on how to scale the workloads.
Metric/Execution rules – Defines how to execute the benchmark and how to calculate and derive the metrics.
Benchmark driver – Defines the requirements for implementing the benchmark driver/program.
Full disclosure report – Defines what needs to be reported and how to organize the disclosure report.
Audit requirements – Defines the requirements for performing a successful auditing process.
System under test – Describes the system architecture with its hardware and software components and their configuration requirements.
Pricing – Defines the pricing of the components in the system under test including the system maintenance.
Energy – Defines the methodology, rules, and metrics to measure the energy consumption of the system under test in the TPC benchmarks.
Relevant: It must measure the peak performance and price/performance of systems when performing typical operations within that problem domain.
Portable: It should be easy to implement the benchmark on many different systems and architectures.
Scalable: The benchmark should apply to small and large computer systems. It should be possible to scale the benchmark up to larger systems and to parallel computer systems as computer performance and architecture evolve.
The benchmark must be understandable/interpretable; otherwise it will lack credibility.
Relevant – A reader of the result believes the benchmark reflects something important.
Repeatable – There is confidence that the benchmark can be run a second time with the same result.
Fair – All systems and/or software being compared can participate equally.
Verifiable – There is confidence that the documented result is real.
Economical – The test sponsors can afford to run the benchmark.
In reality, many of the new benchmarks (in Tables 4, 5 and 6) do not have clear specifications and do not follow the practices defined by Gray (1992) and Huppler (2009) but just provide a workload implementation that can be used in many scenarios. This opens the challenge that the reported benchmark results are not really comparable and strictly depend on the environment in which they were obtained.
In terms of component specification, the situation looks similar. All TPC benchmarks use synthetic data generators, which allow for scalable and deterministic workload generation. However, many new benchmarks use open data sets or real workload traces like BigDataBench (Wang et al. 2014) or a mix between real data and synthetically generated data. This influences also the metrics reported by these benchmarks. They are often not clearly specified or very simplistic (like execution time) and cannot be used for an accurate comparison between different environments.
The ongoing evolution in the big data systems and the data science, machine learning, and deep learning tools and techniques will open many new challenges and questions in the design and specification of standardized analytics benchmarks. There is a growing need for new standardized big data analytics benchmarks and metrics.
To compare different software and hardware systems: The goal is to use metric reported by the benchmark as a comparable unit for evaluating the performance of different data technologies on different hardware running the same application. This case represents classical competitive situation between hardware vendors.
To compare different software on one machine: The goal is to use the benchmark to evaluate the performance of two different software products running on the same hardware environment.This case represents classical competitive situation between software vendors.
To compare different machines in a comparable family: The objective is to compare similar hardware environments by running the same software product and application benchmark on each of them. This case represents a comparison of different generations of vendor hardware or for a case comparing of different hardware vendors.
To compare different releases of a product on one machine: The objective is to compare different releases of a software product by running benchmark experiments on the same hardware. Ideally the new releases should perform faster (based on the benchmark metric) than its predecessors. This can be also seen as performance regression tests that can assure the new release support all previous system features.
- Abadi D, Babu S, Ozcan F, Pandis I (2015) Tutorial: SQL-on-Hadoop systems. PVLDB 8(12):2050–2051Google Scholar
- Agrawal D, Butt AR, Doshi K, Larriba-Pey J, Li M, Reiss FR, Raab F, Schiefer B, Suzumura T, Xia Y (2015) SparkBench – a spark performance testing suite. In: TPCTC, pp 26–44Google Scholar
- Alsubaiee S, Altowim Y, Altwaijry H, Behm A, Borkar VR, Bu Y, Carey MJ, Cetindil I, Cheelangi M, Faraaz K, Gabrielova E, Grover R, Heilbron Z, Kim Y, Li C, Li G, Ok JM, Onose N, Pirzadeh P, Tsotras VJ, Vernica R, Wen J, Westmann T (2014) Asterixdb: a scalable, open source BDMS. PVLDB 7(14):1905–1916Google Scholar
- AMPLab (2013) https://amplab.cs.berkeley.edu/benchmark/
- Andersen B, Pettersen PG (1995) Benchmarking handbook. Champman & Hall, LondonGoogle Scholar
- Armbrust M, Xin RS, Lian C, Huai Y, Liu D, Bradley JK, Meng X, Kaftan T, Franklin MJ, Ghodsi A, Zaharia M (2015) Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, Melbourne, 31 May–4 June 2015, pp 1383–1394Google Scholar
- Armstrong TG, Ponnekanti V, Borthakur D, Callaghan M (2013) Linkbench: a database benchmark based on the Facebook social graph. In: Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD 2013, New York, 22–27 June 2013, pp 1185–1196Google Scholar
- AsterixDB (2018) https://asterixdb.apache.org
- BigFrame (2013) https://github.com/bigframeteam/BigFrame/wiki
- Bog A (2013) Benchmarking transaction and analytical processing systems: the creation of a mixed workload benchmark and its application. PhD thesis. http://d-nb.info/1033231886
- Codd EF, Codd SB, Salley CT (1993) Providing OLAP (On-line analytical processing) to user-analysis: an IT mandate. White paperGoogle Scholar
- Cooper BF, Silberstein A, Tam E, Ramakrishnan R, Sears R (2010) Benchmarking cloud serving systems with YCSB. In: Proceedings of the 1st ACM symposium on cloud computing, SoCC 2010, Indianapolis, 10–11 June 2010, pp 143–154Google Scholar
- Ferdman M, Adileh A, Koçberber YO, Volos S, Alisafaee M, Jevdjic D, Kaynak C, Popescu AD, Ailamaki A, Falsafi B (2012) Clearing the clouds: a study of emerging scale-out workloads on modern hardware. In: Proceedings of the 17th international conference on architectural support for programming languages and operating systems, ASPLOS, pp 37–48Google Scholar
- Ferrarons J, Adhana M, Colmenares C, Pietrowska S, Bentayeb F, Darmont J (2013) PRIMEBALL: a parallel processing framework benchmark for big data applications in the cloud. In: TPCTC, pp 109–124Google Scholar
- Flink (2018) https://flink.apache.org/
- Ghazal A, Rabl T, Hu M, Raab F, Poess M, Crolotte A, Jacobsen H (2013) Bigbench: towards an industry standard benchmark for big data analytics. In: Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD 2013, New York, 22–27 June 2013, pp 1197–1208Google Scholar
- Ghazal A, Ivanov T, Kostamaa P, Crolotte A, Voong R, Al-Kateb M, Ghazal W, Zicari RV (2017) Bigbench V2: the new and improved bigbench. In: 33rd IEEE international conference on data engineering, ICDE 2017, San Diego, 19–22 Apr 2017, pp 1225–1236Google Scholar
- GraphX (2018) https://spark.apache.org/graphx/
- Hadoop (2018) https://hadoop.apache.org/
- Hellerstein JM, Ré C, Schoppmann F, Wang DZ, Fratkin E, Gorajek A, Ng KS, Welton C, Feng X, Li K, Kumar A (2012) The MADlib analytics library or MAD skills, the SQL. PVLDB 5(12):1700–1711Google Scholar
- Hive (2018) https://hive.apache.org/
- Hogan T (2009) Overview of TPC benchmark E: the next generation of OLTP benchmarks. In: Performance evaluation and benchmarking, first TPC technology conference, TPCTC 2009, Lyon, 24–28 Aug 2009, Revised Selected Papers, pp 84–98Google Scholar
- Huang S, Huang J, Dai J, Xie T, Huang B (2010) The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: Workshops proceedings of the 26th IEEE ICDE international conference on data engineering, pp 41–51Google Scholar
- Huppler K (2009) The art of building a good benchmark. In: Nambiar RO, Poess M (eds) Performance evaluation and benchmarking. Springer, Berlin/Heidelberg, pp 18–30Google Scholar
- Impala (2018) https://impala.apache.org/
- Ivanov T, Rabl T, Poess M, Queralt A, Poelman J, Poggi N, Buell J (2015) Big data benchmark compendium. In: TPCTC, pp 135–155Google Scholar
- Kemper A, Neumann T (2011) Hyper: a hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. In: Proceedings of the 27th international conference on data engineering, ICDE 2011, Hannover, 11–16 Apr 2011, pp 195–206Google Scholar
- Kim K, Jeon K, Han H, Kim SG, Jung H, Yeom HY (2008) Mrbench: a benchmark for mapreduce framework. In: 14th international conference on parallel and distributed systems, ICPADS 2008, Melbourne, 8–10 Dec 2008, pp 11–18Google Scholar
- Kornacker M, Behm A, Bittorf V, Bobrovytsky T, Ching C, Choi A, Erickson J, Grund M, Hecht D, Jacobs M, Joshi I, Kuff L, Kumar D, Leblang A, Li N, Pandis I, Robinson H, Rorke D, Rus S, Russell J, Tsirogiannis D, Wanderman-Milne S, Yoder M (2015) Impala: a modern, open-source SQL engine for Hadoop. In: CIDR 2015, seventh biennial conference on innovative data systems research, Asilomar, 4–7 Jan 2015, Online proceedingsGoogle Scholar
- Li M, Tan J, Wang Y, Zhang L, Salapura V (2015) SparkBench: a comprehensive benchmarking suite for in memory data analytic platform spark. In: Proceedings of the 12th ACM international conference on computing frontiers, pp 53:1–53:8Google Scholar
- MADlib (2018) https://madlib.apache.org/
- Meng X, Bradley JK, Yavuz B, Sparks ER, Venkataraman S, Liu D, Freeman J, Tsai DB, Amde M, Owen S, Xin D, Xin R, Franklin MJ, Zadeh R, Zaharia M, Talwalkar A (2016) Mllib: machine learning in Apache spark. J Mach Learn Res 17:34:1–34:7Google Scholar
- MLlib (2018) https://spark.apache.org/mllib/
- Nambiar RO, Poess M (2006) The making of TPC-DS. In: Proceedings of the 32nd international conference on very large data bases, Seoul, 12–15 Sept 2006, pp 1049–1058Google Scholar
- Özcan F, Tian Y, Tözün P (2017) Hybrid transactional/analytical processing: a survey. In: Proceedings of the 2017 ACM international conference on management of data, SIGMOD conference 2017, Chicago, 14–19 May 2017, pp 1771–1775Google Scholar
- Patil S, Polte M, Ren K, Tantisiriroj W, Xiao L, López J, Gibson G, Fuchs A, Rinaldi B (2011) YCSB++: benchmarking and performance debugging advanced features in scalable table stores. In: ACM symposium on cloud computing in conjunction with SOSP 2011, SOCC’11, Cascais, 26–28 Oct 2011, p 9Google Scholar
- Pavlo A, Paulson E, Rasin A, Abadi DJ, DeWitt DJ, Madden S, Stonebraker M (2009) A comparison of approaches to large-scale data analysis. In: Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD 2009, Providence, 29 June–2 July 2009, pp 165–178Google Scholar
- Pirzadeh P, Carey MJ, Westmann T (2015) BigFUN: a performance study of big data management system functionality. In: 2015 IEEE international conference on big data, pp 507–514Google Scholar
- Poess M (2012) Tpc’s benchmark development model: making the first industry standard benchmark on big data a success. In: Specifying big data benchmarks – first workshop, WBDB 2012, San Jose, 8–9 May 2012, and second workshop, WBDB 2012, Pune, 17–18 Dec 2012, Revised Selected Papers, pp 1–10Google Scholar
- Poess M, Rabl T, Jacobsen H, Caufield B (2014) TPC-DI: the first industry benchmark for data integration. PVLDB 7(13):1367–1378Google Scholar
- Poess M, Rabl T, Jacobsen H (2017) Analysis of TPC-DS: the first standard benchmark for SQL-based big data systems. In: Proceedings of the 2017 symposium on cloud computing, SoCC 2017, Santa Clara, 24–27 Sept 2017, pp 573–585Google Scholar
- Pöss M, Nambiar RO, Walrath D (2007) Why you should run TPC-DS: a workload analysis. In: Proceedings of the 33rd international conference on very large data bases, University of Vienna, 23–27 Sept 2007, pp 1138–1149Google Scholar
- Raab F (1993) TPC-C – the standard benchmark for online transaction processing (OLTP). In: Gray J (ed) The benchmark handbook for database and transaction systems, 2nd edn. Morgan Kaufmann, San MateoGoogle Scholar
- Sangroya A, Serrano D, Bouchenak S (2012) MRBS: towards dependability benchmarking for Hadoop MapReduce. In: Euro-Par: parallel processing workshops, pp 3–12Google Scholar
- Sethuraman P, Taheri HR (2010) TPC-V: a benchmark for evaluating the performance of database applications in virtual environments. In: Performance evaluation, measurement and characterization of complex systems – second TPC technology conference, TPCTC 2010, Singapore, 13–17 Sept 2010. Revised Selected Papers, pp 121–135CrossRefGoogle Scholar
- Spark (2018) https://spark.apache.org
- SparkSQL (2018) https://spark.apache.org/sql/
- SparkStreaming (2018) https://spark.apache.org/streaming/
- SPEC (2018) www.spec.org/
- STAC (2018) www.stacresearch.com/
- Storm (2018) https://storm.apache.org/
- Tensorflow (2018) https://tensorflow.org
- Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R (2009) Hive – a warehousing solution over a map-reduce framework. PVLDB 2(2):1626–1629Google Scholar
- TPC (2018) www.tpc.org/
- Wang L, Zhan J, Luo C, Zhu Y, Yang Q, He Y, Gao W, Jia Z, Shi Y, Zhang S, Zheng C, Lu G, Zhan K, Li X, Qiu B (2014) BigDataBench: a big data benchmark suite from internet services. In: 20th IEEE international symposium on high performance computer architecture, HPCA 2014, pp 488–499Google Scholar