Skip to main content

Big Data Generation

  • Conference paper
Specifying Big Data Benchmarks (WBDB 2012, WBDB 2012)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8163))

Included in the following conference series:

Abstract

Big data challenges are end-to-end problems. When handling big data it usually has to be preprocessed, moved, loaded, processed, and stored many times. This has led to the creation of big data pipelines. Current benchmarks related to big data only focus on isolated aspects of this pipeline, usually the processing, storage and loading aspects. To this date, there has not been any benchmark presented covering the end-to-end aspect for big data systems.

In this paper, we discuss the necessity of ETL like tasks in big data benchmarking and propose the Parallel Data Generation Framework (PDGF) for its data generation. PDGF is a generic data generator that was implemented at the University of Passau and is currently adopted in TPC benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 49.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Gray, J.: GraySort Benchmark. Sort Benchmark Home Page, http://sortbenchmark.org

  2. Cooper, B.F., Silberstein, A., Tam, E., Ramakrishnan, R., Sears, R.: Benchmarking Cloud Serving Systems with YCSB. In: SoCC, pp. 143–154 (2010)

    Google Scholar 

  3. Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The HiBench Benchmark Suite: Characterization of the MapReduce-Based Data Analysis. In: ICDEW (2010)

    Google Scholar 

  4. Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., Jacobsen, H.A.: BigBench: Towards an industry standard benchmark for big data analytics. In: Proceedings of the ACM SIGMOD Conference (2013)

    Google Scholar 

  5. Baru, C., Bhandarkar, M., Nambiar, R., Poess, M., Rabl, T.: Benchmarking Big Data Systems and the BigData Top100 List. Big Data 1(1), 60–64 (2013)

    Article  Google Scholar 

  6. Baru, C., Bhandarkar, M., Nambiar, R., Poess, M., Rabl, T.: Setting the Direction for Big Data Benchmark Standards. In: Nambiar, R., Poess, M. (eds.) TPCTC 2012. LNCS, vol. 7755, pp. 197–208. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  7. Carey, M.J.: BDMS Performance Evaluation: Practices, Pitfalls, and Possibilities. In: Nambiar, R., Poess, M. (eds.) TPCTC 2012. LNCS, vol. 7755, pp. 108–123. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  8. Patil, S., Polte, M., Ren, K., Tantisiriroj, W., Xiao, L., Lopez, J., Gibson, G., Fuchs, A., Rinaldi, B.: YCSB++: Benchmarking and performance debugging advanced features in scalable table stores. In: SoCC, pp. 9:1–9:14 (2011)

    Google Scholar 

  9. Rabl, T., Sadoghi, M., Jacobsen, H.A., Gómez-Villamor, S., Muntés-Mulero, V., Mankowskii, S.: Solving Big Data Challenges for Enterprise Application Performance Management. PVLDB 5(12), 1724–1735 (2012)

    Google Scholar 

  10. Wyatt, L., Caufield, B., Pol, D.: Principles for an ETL Benchmark. In: Nambiar, R., Poess, M. (eds.) TPCTC 2009. LNCS, vol. 5895, pp. 183–198. Springer, Heidelberg (2009)

    Google Scholar 

  11. Marsaglia, G.: Xorshift RNGs. Journal of Statistical Software 8(14), 1–6 (2003)

    Google Scholar 

  12. Frank, M., Poess, M., Rabl, T.: Efficient Update Data Generation for DBMS Benchmark. In: ICPE 2012 (2012)

    Google Scholar 

  13. Poess, M., Rabl, T., Frank, M., Danisch, M.: A PDGF Implementation for TPC-H. In: Nambiar, R., Poess, M. (eds.) TPCTC 2011. LNCS, vol. 7144, pp. 196–212. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  14. Rabl, T., Frank, M., Sergieh, H.M., Kosch, H.: A Data Generator for Cloud-Scale Benchmarking. In: Nambiar, R., Poess, M. (eds.) TPCTC 2010. LNCS, vol. 6417, pp. 41–56. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  15. Rabl, T., Lang, A., Hackl, T., Sick, B., Kosch, H.: Generating Shifting Workloads to Benchmark Adaptability in Relational Database Systems. In: Nambiar, R., Poess, M. (eds.) TPCTC 2009. LNCS, vol. 5895, pp. 116–131. Springer, Heidelberg (2009)

    Google Scholar 

  16. Rabl, T., Poess, M.: Parallel data generation for performance analysis of large, complex RDBMS. In: DBTest 2011, p. 5 (2011)

    Google Scholar 

  17. Rabl, T., Poess, M., Danisch, M., Jacobsen, H.A.: Rapid Development of Data Generators Using Meta Generators in PDGF. In: DBTest 2013: Proceedings of the Sixth International Workshop on Testing Database Systems (2013)

    Google Scholar 

  18. Pöss, M., Nambiar, R.O., Walrath, D.: Why You Should Run TPC-DS: A Workload Analysis. In: VLDB, pp. 1138–1149 (2007)

    Google Scholar 

  19. Hunt, D., Inman-Semerau, L., May-Pumphrey, M.A., Sussman, N., Grandjean, P., Newhook, P., Suarez-Ordonez, S., Stewart, S., Kumar, T.: Selenium Documentation (2013), http://docs.seleniumhq.org/docs/

  20. Houkjær, K., Torp, K., Wind, R.: Simple and Realistic Data Generation. In: VLDB 2006: Proceedings of the 32nd International Conference on Very Large Data Bases, VLDB Endowment, pp. 1243–1246 (2006)

    Google Scholar 

  21. Lin, P.J., Samadi, B., Cipolone, A., Jeske, D.R., Cox, S., Rendón, C., Holt, D., Xiao, R.: Development of a Synthetic Data Set Generator for Building and Testing Information Discovery Systems. In: ITNG 2006: Proceedings of the Third International Conference on Information Technology: New Generations, pp. 707–712. IEEE Computer Society, Washington, DC (2006)

    Google Scholar 

  22. Stephens, J.M., Poess, M.: MUDD: a multi-dimensional data generator. In: WOSP 2004: Proceedings of the 4th International Workshop on Software and Performance, pp. 104–109. ACM, New York (2004)

    Google Scholar 

  23. Bruno, N., Chaudhuri, S.: Flexible Database Generators. In: VLDB 2005: Proceedings of the 31st International Conference on Very Large Databases, VLDB Endowment, pp. 1097–1107 (2005)

    Google Scholar 

  24. Alexandrov, A., Tzoumas, K., Markl, V.: Myriad: Scalable and Expressive Data Generation. In: VLDB 2012 (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Rabl, T., Jacobsen, HA. (2014). Big Data Generation. In: Rabl, T., Poess, M., Baru, C., Jacobsen, HA. (eds) Specifying Big Data Benchmarks. WBDB WBDB 2012 2012. Lecture Notes in Computer Science, vol 8163. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-53974-9_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-53974-9_3

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-53973-2

  • Online ISBN: 978-3-642-53974-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics