WBDB 2013, WBDB 2013: Advancing Big Data Benchmarks pp 3-18 | Cite as

A BigBench Implementation in the Hadoop Ecosystem

  • Badrul Chowdhury
  • Tilmann Rabl
  • Pooya Saadatpanah
  • Jiang Du
  • Hans-Arno Jacobsen
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8585)

Abstract

BigBench is the first proposal for an end to end big data analytics benchmark. It features a rich query set with complex, realistic queries. BigBench was developed based on the decision support benchmark TPC-DS. The first proof-of-concept implementation was built for the Teradata Aster parallel database system and the queries were formulated in the proprietary SQL-MR query language. To test other systems, the queries have to be translated.

In this paper, an alternative implementation of BigBench for the Hadoop ecosystem is presented. All 30 queries of BigBench were realized using Apache Hive, Apache Hadoop, Apache Mahout, and NLTK. We will present the different design choices we took and show a proof of concept evaluation.

Keywords

Natural Language Processing Sentiment Analysis Execution Engine Hadoop Distribute File System Hive Version 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Carey, M.J.: BDMS performance evaluation: practices, pitfalls, and possibilities. In: Nambiar, R., Poess, M. (eds.) TPCTC 2012. LNCS, vol. 7755, pp. 108–123. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  2. 2.
    Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., Jacobsen., H.A.: BigBench: towards an industry standard benchmark for big data analytics. In: Proceedings of the ACM SIGMOD Conference (2013)Google Scholar
  3. 3.
    Pöss, M., Nambiar, R.O., Walrath, D.: Why you should run TPC-DS: a workload analysis. In: VLDB, pp. 1138–1149 (2007)Google Scholar
  4. 4.
    Rabl, T., Ghazal, A., Hu, M., Crolotte, A., Raab, F., Poess, M., Jacobsen, H.-A.: BigBench specification V0.1. In: Rabl, T., Poess, M., Baru, C., Jacobsen, H.-A. (eds.) WBDB 2012. LNCS, vol. 8163, pp. 164–201. Springer, Heidelberg (2014)CrossRefGoogle Scholar
  5. 5.
    Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., Byers, A.H.: Big data: the next frontier for innovation, competition, and productivity. Technical report, McKinsey Global Institute (2011). http://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovation
  6. 6.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  7. 7.
    Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop distributed file system. In: 26th IEEE Symposium on Mass Storage Systems and Technologies, pp. 1–10 (2010)Google Scholar
  8. 8.
    Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a Map-Reduce framework. Proc. VLDB Endow. 2(2), 1626–1629 (2009)CrossRefGoogle Scholar
  9. 9.
    Bird, S., Klein, E., Loper, E., Baldridge, J.: Multidisciplinary instruction with the natural language toolkit. In: Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics, TeachCL ’08, pp. 62–70 (2008)Google Scholar
  10. 10.
    Moussa, R.: TPC-H benchmark analytics scenarios and performances on Hadoop data clouds. In: Benlamri, R. (ed.) NDT 2012, Part I. CCIS, vol. 293, pp. 220–234. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  11. 11.
    Kim, K., Jeon, K., Han, H., Kim, S., Jung, H., Yeom, H.: MRBench: a benchmark for MapReduce framework. In: 14th IEEE International Conference on Parallel and Distributed Systems, 2008, ICPADS ’08, December 2008, pp. 11–18 (2008)Google Scholar
  12. 12.
    Zhao, J.M., Wang, W., Liu, X.: Big data benchmark - Big DS. In: Rabl, T., Raghunath, N., Meikel, P., Milind, B., Jacobsen, H.-A., Chaitanya, B. (eds.) WBDB 2013. LNCS, vol. 8585, pp. 49–57. Springer, Heidelberg (2014)Google Scholar
  13. 13.
    Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: ICDEW (2010)Google Scholar
  14. 14.
    Yi, L., Dai, J.: Experience from hadoop benchmarking with HiBench: from micro-benchmarks toward end-to-end pipelines. In: Rabl, T., Raghunath, N., Meikel, P., Milind, B., Jacobsen, H.-A., Chaitanya, B. (eds.) WBDB 2013. LNCS, vol. 8585, pp. 43–48. Springer, Heidelberg (2014)Google Scholar
  15. 15.
    Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: SIGMOD ’09: Proceedings of the 35th SIGMOD International Conference on Management of Data, pp. 165–178 (2009)Google Scholar
  16. 16.
    Wang, L., Zhan, J., Luo, C., Zhu, Y., Yang, Q., He, Y., Gao, W., Jia, Z., Shi, Y., Zhang, S., Zhen, C., Lu, G., Zhan, K., Li, X., Qiu, B.: BigDataBench: a big data benchmark suite from internet services. In: Proceedings of the 20th IEEE International Symposium on High Performance Computer Architecture. HPCA (2014)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Badrul Chowdhury
    • 1
  • Tilmann Rabl
    • 1
  • Pooya Saadatpanah
    • 2
  • Jiang Du
    • 2
  • Hans-Arno Jacobsen
    • 1
  1. 1.Middleware Systems Research GroupUniversity of TorontoTorontoCanada
  2. 2.Database Research GroupUniversity of TorontoTorontoCanada

Personalised recommendations