Introducing TPCx-HS: The First Industry Standard for Benchmarking Big Data Systems
- Cite this paper as:
- Nambiar R. et al. (2015) Introducing TPCx-HS: The First Industry Standard for Benchmarking Big Data Systems. In: Nambiar R., Poess M. (eds) Performance Characterization and Benchmarking. Traditional to Big Data. TPCTC 2014. Lecture Notes in Computer Science, vol 8904. Springer, Cham
The designation Big Data has become a mainstream buzz phrase across many industries as well as research circles. Today many companies are making performance claims that are not easily verifiable and comparable in the absence of a neutral industry benchmark. Instead one of the test suites used to compare performance of Hadoop based Big Data systems is the TeraSort. While it nicely defines the data set and tasks to measure Big Data Hadoop systems it lacks a formal specification and enforcement rules that enable the comparison of results across systems. In this paper we introduce TPCx-HS, the industry’s first industry standard benchmark, designed to stress both hardware and software that is based on Apache HDFS API compatible distributions. TPCx-HS extends the workload defined in TeraSort with formal rules for implementation, execution, metric, result verification, publication and pricing. It can be used to asses a broad range of system topologies and implementation methodologies of Big Data Hadoop systems in a technically rigorous and directly comparable and vendor-neutral manner.
KeywordsTPC Big Data Industry standard Benchmark
Big Data technologies like Hadoop have become an important part of the enterprise IT ecosystem. TPC Express Benchmark™HS (TPCx-HS) was developed to provide an objective measure of hardware, operating system and commercial Apache HDFS API compatible software distributions . TPCx-HS is TPC’s first benchmark developed in the TPC Express Benchmark™ category [1, 2, 3]. TPCx-HS is based on well-known and respected workload defined in TeraSort with formal rules for implementation, execution, metric, result verification, publication and pricing, thereby providing the industry with verifiable performance, price-performance and availability metrics. The benchmark models a continuous system availability of 24 h a day, 7 days a week.
Even though the modeled application is simple, the results are highly relevant to hardware and software dealing with Big Data systems in general. The TPCx-HS stresses both hardware and software including Hadoop run-time, Hadoop File System API compatible systems and MapReduce layers. This workload can be used to asses a broad range of system topologies and implementation of Hadoop clusters. The TPCx-HS can be used to asses a broad range of system topologies and implementation methodologies in a technically rigorous and directly comparable, in a vendor-neutral manner.
2 Introduction to TeraSort
Until 2007, Jim Gray defined, sponsored, and administered a number of sort benchmarks  available to the general community. These include Minute Sort, Gray Sort, Penny Sort, Joule Sort, Datamation Sort and TeraByte Sort. TeraByte Sort measures the amount of time taken (in minutes) to sort 1 TB (1012 Bytes) of data.
In 2009, Owen O’Malley et al. of Yahoo! Inc. published the results and a MapReduce implementation of TeraByte Sort called TeraSort . It was implemented using the Hadoop MapReduce framework. The implementation used three Hadoop MapReduce applications called, TeraGen, TeraSort, and TeraValidate described here.
TeraGen performs the task of generating the input data to be sorted. It generates exactly the same data, byte for byte, as generated by the data generation application originally defined by the TeraByte Sort benchmark written in C. It is implemented using multiple map tasks, in which, each map instance is assigned a portion of the keys to generate. This is done by starting the random generator with the same seed on each mapper each of which skip the generated numbers until it reaches its target record range.
TeraSort uses the regular map-reduce sort except for a custom partitioner that splits the mapper output into N-1 sampled keys to ensure that each of the N reducer, receives records with keys k such that sample[i-1] <= k < sample[i], where, i is the reducer instance number. The key sampling process is performed before the actual sorting is done and written to HDFS to be used during the sorting process.
TeraValidate validates that the output is sorted globally. This is done by ensuring that one mapper validates the contents of each sorted output file from TeraSort by ensuring that the keys are ordered. When the mapper is done with an output file it emits one key that consists of the first and last key consumed by it. The reduce process then takes the output from each mapper and makes sure that there is no overlap between mapper outputs ensuring that the files are properly sorted.
The metric is one of the fundamental components of any benchmark definition and probably the most controversial when trying to reach an agreement between different companies. The execution rules define the way a benchmark is executed, while the metric emphasizes the pieces that are measured. TPC is best known for providing robust, simple and verifiable performance data . The most visible part of the performance data is the performance metric. Producing benchmark results is expensive and time consuming. Hence, the TPC’s goal is to provide a robust performance metric, which allows for system performance comparisons for an extended period and, thereby, preserving the investments companies make into publishing benchmarks.
In general, a performance metric needs to be simple so that easy system comparisons are possible. If there are multiple performance metrics (e.g. A, B, C), system comparisons are difficult because vendors can claim they perform well on some of the metrics (e.g. A and C). This might still be acceptable if all components are equally important, however without this determination, there would be much debate on this issue. In order to unambiguously rank results, the TPC benchmarks focus on a single primary performance metric, which encompass all aspects of a system’s performance weighing each individual component. Taking the example from above, the performance metric M is calculated as a function of the three components A, B and C (e.g. M = f(A,B,C)). Consequently, TPC’s performance metrics measure system and overall workload performance rather than individual component performance. In addition to the performance metric, the TPC also includes other metrics, such as price-performance metrics.
The TPC distinguishes between Primary and Secondary Metrics. Each TPC-Express Benchmark Standard must define Primary Metrics selected to represent the workload being measured. The Primary Metrics must include both performance and price/performance metrics .
HSph@SF: Composite Performance Metric, reflecting the TPCx-HS throughput;
where SF is the Scale Factor;
$/HSph@SF: Price-Performance metric;
System availability Date.
TG, Data generation phase completion time with HSGen reported in hh:mm:ss format;
TS, Data sort phase completion time with HSSort reported in hh:mm:ss format;
TV, Data validation phase completion time reported in hh:mm:ss format;
Each secondary metric shall be referenced in conjunction with the scale factor at which it was achieved. For example, TPCx-HS TG references shall take the form of TPCx-HS TG @ SF, or “TPCx-HS TG = 2 h @ 1”.
The System Availability Date is defined in the TPC Pricing Specification . A TPCx-HS Result is only comparable with other TPCx-HS Results of the same Scale Factor.
Results at the different scale factors are not comparable, due to the substantially different computational challenges found at different data volumes. Similarly, the system price/performance may not scale down linearly with a decrease in dataset size due to configuration changes required by changes in dataset size.
If results measured against different dataset sizes (i.e., with different scale factors) appear in a printed or electronic communication, then each reference to a result or metric must clearly indicate the dataset size against which it was obtained. In particular, all textual references to TPCx-HS metrics (performance or price/performance) appearing must be expressed in the form that includes the size of the test dataset as an integral part of the metric’s name; i.e. including the “@SF” suffix. This applies to metrics quoted in text or tables as well as those used to annotate charts or graphs. If metrics are presented in graphical form, then the test dataset size on which metric is based must be immediately discernible either by appropriate axis labeling or data point labeling.
In addition, the results must be accompanied by a disclaimer stating: “The TPC believes that comparisons of TPCx-HS results measured against different dataset sizes are misleading and discourages such comparisons”.
TPC Benchmarks™ are intended to provide a fair and honest comparison of various vendor implementations to accomplish an identical, controlled and repeatable task. The pricing for these implementations must also allow a fair and honest comparison for customers to review .
The cost associated with achieving a particular TPCx-HS benchmark score is an important piece of information for decision makers. The pricing gives total hardware, software and maintenance prices of the total system for 3 years. Hadoop systems are based on massive scale-out nature of the systems, having pricing included in the benchmark provides a consequence to using lots of hardware and software resources to achieve a high score by showing how much it would cost to achieve that score. A published benchmark may show an attractive level of performance, but if the hardware configuration required to achieve it is overly expensive then the attractiveness of that benchmark is reduced.
The TPCx-HS price/performance metric provides a way to compare the effectiveness of the published results by showing how much it costs to achieve each unit of performance. This metric can be used to compare the effectiveness of each published result regardless of the size of the configuration. It also provides additional value to the TPCx-HS benchmark by providing the opportunity to focus benchmark publications on this metric rather than highest performance.
Therefore, the ability to have pricing in the TPCx-HS benchmark is another way the TPC adds additional value to the TeraSort workload.
Historically the TPC benchmarks adapted an independent before publishing the benchmark. Recently the TPC classified the benchmarks in to two categories - Enterprise and Express. While independent audits are required for Enterprise benchmarks, either an independent audit or a peer review process can be used for Express benchmarks .
The auditing agency cannot be financially related to the sponsor. For example, the auditing agency is financially related if it is a dependent division, the majority of its stock is owned by the sponsor, etc.
The auditing agency cannot be financially related to any one of the suppliers of the measured/priced components, e.g., the DBMS supplier, the terminal or terminal concentrator supplier, etc.
The A peer review audit is the evaluation of a submitted benchmark result by one or more groups of members in the relevant subcommittee. It comprises a method of reviewing the results by members the relevant subcommittee. This peer review technique is implemented to verify the specification’s compliance of the submitted benchmark’s information. Perhaps, one can draw a parallel of this peer review audit to the academic peer review process to assess a paper to be published in a journal.
Following the publication of the benchmark, the benchmark is available for additional review for a period of 60 days during which TPC member companies can challenge the result of the benchmark or any other benchmarks still within their peer review period.
With the fast-changing landscape of data, applications and workloads, the TPC is developing some new benchmarks and reconsidering the adoption of other methods of auditing to compliment the current method. One of which is the peer review audit as described above.
4.1 The Pros and Cons of Independent and Peer-Review Audits
One of the main advantages of independent audit is its discrete nature of the relationship between the party and auditor during the auditing process. The exchange between the two parties is confidential as it should be since the information revealed during this interaction often comprises of company’s proprietary data and any intellectual property. Thus, the benchmark outcome is kept secret until the benchmark has passed the audit and published. The confidential nature of the audit becomes very important when the benchmark result is part of a company’s key announcement such as new product launch, customer events, etc. To prepare for these events, companies spend significant amount of time, engineering effort and resources and is often leads to an announcement that is a part of a company’s strategy. Therefore, it is essential that it the information regarding the benchmark result be revealed via peer review audit.
Another advantage of independent audit is the auditor’s assistance in the whole process to make sure the benchmark result is compliant. A “friendly” auditor works with the company toward this goal. On the contrary, the peer review audit is seen as a very competitive process in which companies try to dislodge the competitor’s result during their faultfinding mission. It can be very disruptive to the company who is trying to obtain benchmark results for a product launch or company event.
Another advantage of independent audit includes consultation with auditor on various questions for compliance such as a hardware setting that may violate the specification, software parameter has a known performance gain but not publicly available, a product used in the benchmark but not supported, etc.
The auditor also offers confidential proxy interactions between the company and TPC. This often arises during requests for interpretation on difficult topics such as new technology implementation of novel technique has not seen on previous benchmark. Additionally, the auditor provides a complete review of benchmark configuration, testing application protocol, benchmark results in accordance to the TPC provided auditing lists.
The independent, certified auditor’s experience, credence and knowledge - due to their participation over the years with TPC - provide the credibility to the auditing method. This begs the question “why then do we need to consider the peer review audit method if independent audit satisfies the needs?”
Paradoxically, the independent auditor’s useful assistance, experience, and knowledge offered during the audit are the basis for the method’s weakness. The auditor’s usefulness increases the cost to the benchmark at the time when companies are looking for ways to cut cost. The “free” peer review audit begins to look more attractive in the cases where the parties are willing to forgo the confidentiality of the benchmark. Other than the added value listed above that the independent audit, the peer review audit can meet other requirements outlined in the auditing lists.
While, one of the main advantages of the peer review audit includes the rigor of being evaluated by multiple parties whose interests are diverse.
Hence, the TPCx-HS has decided to adopt peer review method to augment the traditional independent audit. This approach offers options which companies can choose to fit their needs. In general, the TPC provides the audit’s flexibility, while addressing confidentiality, cost saving, rigorousness, ease of benchmark as we adapt to the forever changing world of transaction processing.
5 Sizing and Scale Factors
TPCx-HS follows a stepped benchmark sizing model. Unlike TeraSort which can be scaled using an arbitrary number of rows in the dataset, TPC-xHS limits the choices to one of the following1: 10 B, 30 B, 100 B, 300 B, 1000 B, 3000 B, 10000 B, 30000 B and 100000 B, where each row/record is 100 bytes. In TPC-xHS these dataset sizes are referred to in terms of Scale Factors, which are defined as follows: 1 TB, 3 TB, 10 TB, 30 TB, 100 TB, 300 TB, 1000 TB, 3000 TB and 10000 TB. For example a 3 TB Scale Factor corresponds to a dataset with 30B rows. The primary motivation for choosing a stepped design in benchmark sizing is to ease the comparison of results across different systems. However, it should be noted that results at different Scale Factors are not comparable to each other due to the substantially different computational challenges found at different data volumes. Similarly, the system price/performance may not scale down linearly with a decrease in dataset size due to configuration changes required by changes in dataset size.
6 Benchmark Execution
HSGen generates the input dataset at a particular Scale Factor.
HSSort sorts the input dataset in total order.
HSValidate validates the output dataset is globally sorted.
HSDataCheck verifies the cardinality, size and replication factor of the dataset.
HSGen, HSSort and HSValidate are based on TeraGen, TeraSort and TeraValidate (as described in Sect. 2) respectively. The TPC-xHS kit also includes HSDataCheck which verifies that the dataset generated by HSGen and the output produced by HSSort matches the specified Scale Factor.
Generation of input data via HSGen.
Verification (cardinality, size and replication) of the input data via HSDataCheck.
Sorting the input data using HSSort.
Verification (cardinality, size and replication) of the sorted dataset via HSDataCheck.
Validation of the sorted output data via HSValidate.
No part of the SUT may be rebooted or restarted during or between the runs or any of the phases. If there is a unrecoverable error reported by any of the applications, operating system, or hardware in any of the five phases, the run is considered invalid. If a recoverable error is detected in any of the phases, and is automatically dealt with or corrected by the applications, operating system, or hardware then the run is considered valid. However, manual user intervention is not allowed. If the recoverable error requires manual intervention to deal with or correct then the run is considered invalid. A minimum of three-way data replication must be maintained throughout the run.
The SUT cannot be reconfigured, changed, or re-tuned by the user during or between any of the five phases or between Run 1 and Run 2. Any manual tunings to the SUT must be performed before the beginning of Phase 1 of Run 1, and must be fully disclosed. Any automated changes or tuning performed by the OS or commercially available product between any of the phases is allowed. Any changes to default tunings or parameters of the applications, operating systems, or hardware of the SUT must be disclosed as well.
7 Energy Metric and Power Measurement
The energy metric and power measurement in TPCx-HS benchmark is based on TPC-Energy Specification which contains the rules and methodology for measuring and reporting energy metrics . Reporting energy metric is optional.
During benchmark test energy is consumed by each device in SUT, specifically, the compute devices, data storage devices, also the hardware devices of all networks required to connect and support the SUT systems. As defined in TPCx-HS , if the option TPC-Energy secondary metrics is reported, the components which are included in each subsystem must be identified. For each subsystem, the calculations defined for the TPC-Energy secondary metrics must be reported using the Performance Metric of the entire SUT and the energy consumption for each subsystem under report. Power should be measured for the entire system under test .
If the SUT shares power input with other devices that are not in the SUT’s device list, a power measurement subset has to be defined that only includes the SUT devices. The measurement points need to be identified for the SUT, board level or even chip level power measurement might be required. In some cases, the SUT power can be obtained by using the total power minus the power of non-SUT devices.
7.1 Power Measurement Methods
If the elapsed time to perform next measurement is s, the time k + s is the next sampling point, and s is the measurement interval. The measurement interval can be defined at different levels according to the time length, such as less than 1 s, equal to 1 s or greater than 1 s. For the measurement on AC powers, the integration of the total-energy function of power analyzers can sample the input power multiple times per AC cycle and therefore much less susceptible to sampling artifacts caused by the AC waveform.
The power measurement results of each sampling period can be plotted together to obtain a power chart for the benchmark program. The chart shows the power usage against execution time during the benchmark test for each device, as shown in Fig. 2(b).
7.2 Energy Calculation Based on Measurement Results
All real measurements can be done with Eq. (4) by using power analyzers.
7.3 TPCx-HS Energy Metric Report
The TPC has played a crucial role in providing the industry with relevant standards for total system performance, price-performance, and energy efficiency comparisons [10, 11]. TPC benchmarks are widely used by database researchers and academia. Historically known for database centric standards, the TPC has developed benchmarks for virtualization and data integration as industry demanded for those benchmarks.
Now Big Data has become an integral part of enterprise IT, the TPCx-HS is TPC’s first major step in creating a set of industry strands for measuring various aspects of hardware and software systems dealing with Big Data. Developed as an Express benchmark by extending the workload defined in TeraSort with formal rules for implementation, execution, metric, result verification, publication and pricing; the TPCx-HS is designed to stress both hardware and software that is based on Apache Hadoop MapReduce and HDFS API compatible distributions. We expect that TPCx-HS will be used by customers when evaluating systems for Big Data systems in terms of performance, price/performance and energy efficiency, and enable healthy competition that will result in product developments and improvements.
There is no inherent scale limitation in the benchmark. Larger datasets can be added (and smaller ones retired) based on industry trends over time.
Developing an industry standard benchmark for a new environment like Big Data has taken the dedicated efforts of experts across many companies. The authors thank the contributions of Andrew Bond (Red Hat), Andrew Masland (NEC), Avik Dey (Intel), Brian Caufield (IBM), Chaitanya Baru (SDSC), Da Qi Ren (Huawei), Dileep Kumar (Cloudera), Jamie Reding (Microsoft), John Fowler (Oracle), John Poelman (IBM), Karthik Kulkarni (Cisco), Meikel Poess (Oracle), Mike Brey (Oracle), Mike Crocker (SAP), Paul Cao (HP), Raghunath Nambiar (Cisco), Reza Taheri (VMware), Simon Harris (IBM), Tariq Magdon-Ismail (VMware), Wayne Smith (Intel), Yanpei Chen (Cloudera), Michael Majdalany (L&M), Forrest Carman (Owen Media) and Andreas Hotea (Hotea Solutions).