Abstract
Big data challenges are end-to-end problems. When handling big data it usually has to be preprocessed, moved, loaded, processed, and stored many times. This has led to the creation of big data pipelines. Current benchmarks related to big data only focus on isolated aspects of this pipeline, usually the processing, storage and loading aspects. To this date, there has not been any benchmark presented covering the end-to-end aspect for big data systems.
In this paper, we discuss the necessity of ETL like tasks in big data benchmarking and propose the Parallel Data Generation Framework (PDGF) for its data generation. PDGF is a generic data generator that was implemented at the University of Passau and is currently adopted in TPC benchmarks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Gray, J.: GraySort Benchmark. Sort Benchmark Home Page, http://sortbenchmark.org
Cooper, B.F., Silberstein, A., Tam, E., Ramakrishnan, R., Sears, R.: Benchmarking Cloud Serving Systems with YCSB. In: SoCC, pp. 143–154 (2010)
Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The HiBench Benchmark Suite: Characterization of the MapReduce-Based Data Analysis. In: ICDEW (2010)
Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., Jacobsen, H.A.: BigBench: Towards an industry standard benchmark for big data analytics. In: Proceedings of the ACM SIGMOD Conference (2013)
Baru, C., Bhandarkar, M., Nambiar, R., Poess, M., Rabl, T.: Benchmarking Big Data Systems and the BigData Top100 List. Big Data 1(1), 60–64 (2013)
Baru, C., Bhandarkar, M., Nambiar, R., Poess, M., Rabl, T.: Setting the Direction for Big Data Benchmark Standards. In: Nambiar, R., Poess, M. (eds.) TPCTC 2012. LNCS, vol. 7755, pp. 197–208. Springer, Heidelberg (2013)
Carey, M.J.: BDMS Performance Evaluation: Practices, Pitfalls, and Possibilities. In: Nambiar, R., Poess, M. (eds.) TPCTC 2012. LNCS, vol. 7755, pp. 108–123. Springer, Heidelberg (2013)
Patil, S., Polte, M., Ren, K., Tantisiriroj, W., Xiao, L., Lopez, J., Gibson, G., Fuchs, A., Rinaldi, B.: YCSB++: Benchmarking and performance debugging advanced features in scalable table stores. In: SoCC, pp. 9:1–9:14 (2011)
Rabl, T., Sadoghi, M., Jacobsen, H.A., Gómez-Villamor, S., Muntés-Mulero, V., Mankowskii, S.: Solving Big Data Challenges for Enterprise Application Performance Management. PVLDB 5(12), 1724–1735 (2012)
Wyatt, L., Caufield, B., Pol, D.: Principles for an ETL Benchmark. In: Nambiar, R., Poess, M. (eds.) TPCTC 2009. LNCS, vol. 5895, pp. 183–198. Springer, Heidelberg (2009)
Marsaglia, G.: Xorshift RNGs. Journal of Statistical Software 8(14), 1–6 (2003)
Frank, M., Poess, M., Rabl, T.: Efficient Update Data Generation for DBMS Benchmark. In: ICPE 2012 (2012)
Poess, M., Rabl, T., Frank, M., Danisch, M.: A PDGF Implementation for TPC-H. In: Nambiar, R., Poess, M. (eds.) TPCTC 2011. LNCS, vol. 7144, pp. 196–212. Springer, Heidelberg (2012)
Rabl, T., Frank, M., Sergieh, H.M., Kosch, H.: A Data Generator for Cloud-Scale Benchmarking. In: Nambiar, R., Poess, M. (eds.) TPCTC 2010. LNCS, vol. 6417, pp. 41–56. Springer, Heidelberg (2011)
Rabl, T., Lang, A., Hackl, T., Sick, B., Kosch, H.: Generating Shifting Workloads to Benchmark Adaptability in Relational Database Systems. In: Nambiar, R., Poess, M. (eds.) TPCTC 2009. LNCS, vol. 5895, pp. 116–131. Springer, Heidelberg (2009)
Rabl, T., Poess, M.: Parallel data generation for performance analysis of large, complex RDBMS. In: DBTest 2011, p. 5 (2011)
Rabl, T., Poess, M., Danisch, M., Jacobsen, H.A.: Rapid Development of Data Generators Using Meta Generators in PDGF. In: DBTest 2013: Proceedings of the Sixth International Workshop on Testing Database Systems (2013)
Pöss, M., Nambiar, R.O., Walrath, D.: Why You Should Run TPC-DS: A Workload Analysis. In: VLDB, pp. 1138–1149 (2007)
Hunt, D., Inman-Semerau, L., May-Pumphrey, M.A., Sussman, N., Grandjean, P., Newhook, P., Suarez-Ordonez, S., Stewart, S., Kumar, T.: Selenium Documentation (2013), http://docs.seleniumhq.org/docs/
Houkjær, K., Torp, K., Wind, R.: Simple and Realistic Data Generation. In: VLDB 2006: Proceedings of the 32nd International Conference on Very Large Data Bases, VLDB Endowment, pp. 1243–1246 (2006)
Lin, P.J., Samadi, B., Cipolone, A., Jeske, D.R., Cox, S., Rendón, C., Holt, D., Xiao, R.: Development of a Synthetic Data Set Generator for Building and Testing Information Discovery Systems. In: ITNG 2006: Proceedings of the Third International Conference on Information Technology: New Generations, pp. 707–712. IEEE Computer Society, Washington, DC (2006)
Stephens, J.M., Poess, M.: MUDD: a multi-dimensional data generator. In: WOSP 2004: Proceedings of the 4th International Workshop on Software and Performance, pp. 104–109. ACM, New York (2004)
Bruno, N., Chaudhuri, S.: Flexible Database Generators. In: VLDB 2005: Proceedings of the 31st International Conference on Very Large Databases, VLDB Endowment, pp. 1097–1107 (2005)
Alexandrov, A., Tzoumas, K., Markl, V.: Myriad: Scalable and Expressive Data Generation. In: VLDB 2012 (2012)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Rabl, T., Jacobsen, HA. (2014). Big Data Generation. In: Rabl, T., Poess, M., Baru, C., Jacobsen, HA. (eds) Specifying Big Data Benchmarks. WBDB WBDB 2012 2012. Lecture Notes in Computer Science, vol 8163. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-53974-9_3
Download citation
DOI: https://doi.org/10.1007/978-3-642-53974-9_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-53973-2
Online ISBN: 978-3-642-53974-9
eBook Packages: Computer ScienceComputer Science (R0)