Abstract
Since the introduction of Apache YARN, which modularly separated resource management and scheduling from the distributed programming frameworks, a multitude of YARN-native computation frameworks have been developed. These frameworks specialize in specific analytics variants. In addition to traditional batch-oriented computations (e.g. MapReduce, Apache Hive [14] and Apache Pig [18]), the Apache Hadoop ecosystem now contains streaming analytics frameworks (e.g. Apache Apex [8]), MPP SQL engines (e.g. Apache Trafodion [20], Apache Impala [15], and Apache HAWQ [12]), OLAP cubing frameworks (e.g. Apache Kylin [17]), frameworks suitable for iterative machine learning (e.g. Apache Spark [19] and Apache Flink [10]), and graph processing (e.g. GraphX). With emergence of Hadoop Distributed File System and its various implementations as preferred method of constructing a data lake, end-to-end data pipelines are increasingly being built on the Hadoop-based data lake platform.
While benchmarks have been developed for individual tasks, such as Sort (TPCx-HS [5]), and Analytical SQL queries (TPC-xBB [6]), there is a need for a standard benchmark that exercises various phases of an end-to-end data pipeline in a data lake. In this paper, we propose a benchmark called AdBench, which combines Ad-Serving, Streaming Analytics on Ad-serving logs, streaming ingestion and updates of various data entities, batch-oriented analytics (e.g. for Billing), Ad-Hoc analytical queries, and Machine learning for Ad targeting. While this benchmark is specific to modern Web or Mobile advertising companies and exchanges, the workload characteristics are found in many verticals, such as Internet of Things (IoT), financial services, retail, and healthcare. We also propose a set of metrics to be measured for each phase of the pipeline, and various scale factors of the benchmark.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Referred to in the industry as a “Data Lake”.
- 2.
We distinguish between users, who browse through Acme’s website, from customers, who publish advertisements on that website.
References
Baru, C., et al.: Discussion of BigBench: a proposed industry standard performance benchmark for big data. In: Nambiar, R., Poess, M. (eds.) TPCTC 2014, vol. 8904, pp. 44–63. Springer, Heidelberg (2014)
Baru, C., Bhandarkar, M., Nambiar, R., Poess, M., Rabl, T.: Benchmarking big data systems and the bigdata top 100 list. Big Data 1(1), 60–64 (2013)
Cask Data, Inc., Cask Data Application Platform (CDAP), June 2016
Standard Performance Evaluation Corporation. SPEC Website, June 2016
Transaction Processing Performance Council. TPC Express Benchmark HS, Standard Specification, Version 1.4.0, April 2016
Transaction Processing Performance Council. TPC Express Big Bench, Standard Specification, Version 1.1.0, May 2016
Transaction Processing Performance Council. TPC Website, June 2016
Apache Software Foundation. Apache Apex, June 2016
Apache Software Foundation. Apache Cassandra, June 2016
Apache Software Foundation. Apache Flink, June 2016
Apache Software Foundation. Apache Hadoop, June 2016
Apache Software Foundation. Apache HAWQ (inbcubating), June 2016
Apache Software Foundation. Apache HBase, June 2016
Apache Software Foundation. Apache Hive, June 2016
Apache Software Foundation. Apache Impala, June 2016
Apache Software Foundation. Apache Kafka, June 2016
Apache Software Foundation. Apache Kylin, June 2016
Apache Software Foundation. Apache Pig, June 2016
Apache Software Foundation. Apache Spark, June 2016
Apache Software Foundation. Apache Trafodion (incubating), June 2016
Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., Jacobsen, H.-A.: Bigbench: towards an industry standard benchmark for big data analytics. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, pp. 1197–1208. ACM, New York (2013)
Huppler, K., Johnson, D.: TPC express – a new path for TPC benchmarks. In: Nambiar, R., Poess, M. (eds.) TPCTC 2013. LNCS, vol. 8391, pp. 48–60. Springer, Heidelberg (2014). doi:10.1007/978-3-319-04936-6_4
MongoDB, Inc., MongoDB, June 2016
Rabl, T., Poess, M., Baru, C., Jacobsen, H.-A. (eds.): WBDB 2012. LNCS, vol. 8163. Springer, Heidelberg (2013)
Rabl, T., Jacobsen, H.-A., Raghunath, N., Poess, M., Bhandarkar, M., Baru, C. (eds.): WBDB 2013. LNCS, vol. 8585. Springer, Heidelberg (2014)
Rabl, T., Sachs, K., Poess, M., Baru, C., Jacobson, H.-A. (eds.): WBDB 2014. LNCS, vol. 8991. Springer, Heidelberg (2015)
Yahoo Storm Engineering Team. Benchmarking Streaming Computation Engines at Yahoo! December 2015
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Bhandarkar, M. (2017). AdBench: A Complete Benchmark for Modern Data Pipelines. In: Nambiar, R., Poess, M. (eds) Performance Evaluation and Benchmarking. Traditional - Big Data - Internet of Things. TPCTC 2016. Lecture Notes in Computer Science(), vol 10080. Springer, Cham. https://doi.org/10.1007/978-3-319-54334-5_8
Download citation
DOI: https://doi.org/10.1007/978-3-319-54334-5_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-54333-8
Online ISBN: 978-3-319-54334-5
eBook Packages: Computer ScienceComputer Science (R0)