Skip to main content

BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8585))

Abstract

Data generation is a key issue in big data benchmarking that aims to generate application-specific data sets to meet the 4 V requirements of big data. Specifically, big data generators need to generate scalable data (Volume) of different types (Variety) under controllable generation rates (Velocity) while keeping the important characteristics of raw data (Veracity). This gives rise to various new challenges about how we design generators efficiently and successfully. To date, most existing techniques can only generate limited types of data and support specific big data systems such as Hadoop. Hence we develop a tool, called Big Data Generator Suite (BDGS), to efficiently generate scalable big data while employing data models derived from real data to preserve data veracity. The effectiveness of BDGS is demonstrated by developing six data generators covering three representative data types (structured, semi-structured and unstructured) and three data sources (text, graph, and table data).

Keywords

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. http://www-01.ibm.com/software/data/bigdata/

  2. http://www.tpc.org/tpcds/

  3. Amazon movie reviews. http://snap.stanford.edu/data/web-Amazon.html

  4. Facebook graph. http://snap.stanford.edu/data/egonets-Facebook.html

  5. Google web graph. http://snap.stanford.edu/data/web-Google.html

  6. Lda-c home page. http://www.cs.princeton.edu/blei/lda-c/index.html

  7. Topic model. http://en.wikipedia.org/wiki/Topic_model

  8. wikipedia. http://en.wikipedia.org

  9. Armstrong, T.G., Ponnekanti, V., Borthakur, D., Callaghan, M.: Linkbench: a database benchmark based on the facebook social graph. In: SIGMOD’13 (2013)

    Google Scholar 

  10. Barroso, L.A., Hölzle, U.: The datacenter as a computer: an introduction to the design of warehouse-scale machines. Synth. Lect. Comput. Archit. 4(1), 1–108 (2009)

    Article  Google Scholar 

  11. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  12. Fourneau, J.-M., Pekergin, N.: Benchmark. In: Calzarossa, M.C., Tucci, S. (eds.) Performance 2002. LNCS, vol. 2459, pp. 179–207. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  13. Ferdman, M., Adileh, A., Kocberber, O., Volos, S., Alisafaee, M., Jevdjic, D., Kaynak, C., Popescu, A.D., Ailamaki, A., Falsafi, B.: Clearing the clouds: a study of emerging workloads on modern hardware. In: Proceedings of the 17th Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2012, pp. 1–11 (2011)

    Google Scholar 

  14. Gao, W., Zhu, Y., Jia, Z., Luo, C., Wang, L., Li, Z., Zhan, J., Qi, Y., He, Y., Gong, S., et al.: Bigdatabench: a big data benchmark suite from web search engines. In: The Third Workshop on Architectures and Systems for Big Data (ASBD 2013), in conjunction with ISCA 2013 (2013)

    Google Scholar 

  15. Ghazal, A.: Big data benchmarking-data model proposal. In: First Workshop on Big Data Benchmarking, San Jose, Califorina (2012)

    Google Scholar 

  16. Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., Jacobsen, H.-A.: Bigbench: towards an industry standard benchmark for big data analytics. In: SIGMOD, ACM (2013)

    Google Scholar 

  17. Gray, J., Sundaresan, P., Englert, S., Baclawski, K., Weinberger, P.J.: Quickly generating billion-record synthetic databases. In: ACM SIGMOD Record, vol. 23, pp. 243–252. ACM (1994)

    Google Scholar 

  18. Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW), pp. 41–51. IEEE (2010)

    Google Scholar 

  19. IBM. http://www.ibm.com/developerworks/bigdata/karentest/newto.html

  20. Jia, Z., Wang, L., Zhan, J., Zhang, L., Luo, C.: Characterizing data analysis workloads in data centers. In: IEEE International Symposium on Workload Characterization (IISWC), IEEE (2013)

    Google Scholar 

  21. Jia, Z., Zhou, R., Zhu, C., Wang, L., Gao, W., Shi, Y., Zhan, J., Zhang, L.: The Implications of diverse applications and scalable data sets in benchmarking big data systems. In: Rabl, T., Poess, M., Baru, C., Jacobsen, H.-A. (eds.) WBDB 2012. LNCS, vol. 8163, pp. 44–59. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  22. Leskovec, J., Chakrabarti, D., Kleinberg, J., Faloutsos, C., Ghahramani, Z.: Kronecker graphs: an approach to modeling networks. J. Mach. Learn. Res. 11, 985–1042 (2010)

    MathSciNet  MATH  Google Scholar 

  23. Leskovec, J., Chakrabarti, D., Kleinberg, J.M., Faloutsos, C.: Realistic, mathematically tractable graph generation and evolution, using kronecker multiplication. In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS (LNAI), vol. 3721, pp. 133–145. Springer, Heidelberg (2005)

    Google Scholar 

  24. Lotfi-Kamran, P., Grot, B., Ferdman, M., Volos, S., Kocberber, O., Picorel, J., Adileh, A., Jevdjic, D., Idgunji, S., Ozer, E., et al.: Scale-out processors. In: Proceedings of the 39th International Symposium on Computer Architecture, pp. 500–511. IEEE (2012)

    Google Scholar 

  25. Luo, C., Zhan, J., Jia, Z., Wang, L., Zhang, L., Sun, N.: Cloudrank-d: benchmarking and ranking cloud computing systems for data processing applications. Front. Comput. Sci. 6(4), 347–362 (2012)

    MathSciNet  Google Scholar 

  26. Rabl, T., Frank, M., Sergieh, H.M., Kosch, H.: A Data generator for cloud-scale benchmarking. In: Nambiar, R., Poess, M. (eds.) TPCTC 2010. LNCS, vol. 6417, pp. 41–56. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  27. Seltzer, M., Krinsky, D., Smith, K., Zhang, X.: The case for application-specific benchmarking. In: Proceedings of the Seventh Workshop on Hot Topics in Operating Systems, 1999, pp. 102–107. IEEE (1999)

    Google Scholar 

  28. Tay, Y.C.: Data generation for application-specific benchmarking. In: VLDB, Challenges and Visions (2011)

    Google Scholar 

  29. Wang, L., Zhan, J., Luo, C., Zhu, Y., Yang, Q., He, Y., Gao, W., Jia, Z., Shi, Y., Zhang, S., Zhen, C., Lu, G., Zhan, K., Qiu, B.: Bigdatabench: A big data benchmark suite from internet services. In: The 20th IEEE International Symposium on High-Performance Computer Architecture(HPCA) (2014)

    Google Scholar 

  30. Zhan, J., Zhang, L., Sun, N., Wang, L., Jia, Z., Luo, C.: High volume computing: Identifying and characterizing throughput oriented workloads in data centers. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), pp. 1712–1721. IEEE (2012)

    Google Scholar 

  31. Zhu, Y., Zhan, J., Weng, C., Nambiar, R., Zhang, J., Chen, X., Wang, L.: Generating comprehensive big data workloads as a benchmarking framework. In: The 19th International Conference on Database Systems for Advanced Applications (DASFAA 2014) (2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianfeng Zhan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Ming, Z. et al. (2014). BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking. In: Rabl, T., Raghunath, N., Poess, M., Bhandarkar, M., Jacobsen, HA., Baru, C. (eds) Advancing Big Data Benchmarks. WBDB WBDB 2013 2013. Lecture Notes in Computer Science(), vol 8585. Springer, Cham. https://doi.org/10.1007/978-3-319-10596-3_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-10596-3_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-10595-6

  • Online ISBN: 978-3-319-10596-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics