Comparing High Level MapReduce Query Languages

  • Robert J. Stewart
  • Phil W. Trinder
  • Hans-Wolfgang Loidl
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6965)

Abstract

The MapReduce parallel computational model is of increasing importance. A number of High Level Query Languages (HLQLs) have been constructed on top of the Hadoop MapReduce realization, primarily Pig, Hive, and JAQL. This paper makes a systematic performance comparison of these three HLQLs, focusing on scale up, scale out and runtime metrics. We further make a language comparison of the HLQLs focusing on conciseness and computational power. The HLQL development communities are engaged in the study, which revealed technical bottlenecks and limitations described in this document, and it is impacting their development.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Atkinson, M.P., Buneman, P.: Types and persistence in database programming languages. ACM Comput. Surv. 19(2), 105–190 (1987)CrossRefGoogle Scholar
  2. 2.
    Beyer, K.S., Ercegovac, V., Krishnamurthy, R., Raghavan, S., Rao, J., Reiss, F., Shekita, E.J., Simmen, D.E., Tata, S., Vaithyanathan, S., Zhu, H.: Towards a scalable enterprise content analytics platform. IEEE Data Eng. Bull. 32(1), 28–35 (2009)Google Scholar
  3. 3.
    Borthakur, D.: The Hadoop Distributed File System: Architecture and Design (2007), http://www.hadoop.apache.org
  4. 4.
    Borthakur, D.: The Hadoop Distributed File System: Architecture and Design. The Apache Software Foundation (2007)Google Scholar
  5. 5.
    code.google.com/p/jaql. Jaql developers message board, http://groups.google.com/group/jaql-users/topics
  6. 6.
    Crockford, D.: The application/json media type for javascript object notation (json). RFC 4627 (Informational) (July 2006)Google Scholar
  7. 7.
    Date, C.J.: An Introduction to Database Systems. Addison-Wesley Longman Publishing Co., Inc., Boston (1991)MATHGoogle Scholar
  8. 8.
    Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  9. 9.
    The Apache Software Foundation. Hadoop — published java implementation of the join benchmark, http://goo.gl/R4ZRd
  10. 10.
    The Apache Software Foundation. Hadoop — wordcount example, http://wiki.apache.org/hadoop/WordCount
  11. 11.
    The Apache Software Foundation. Hive — language manual for the join function, https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
  12. 12.
    The Apache Software Foundation. Pig 0.8 — release notes (December 2010), http://goo.gl/ySUln
  13. 13.
    The Apache Software Foundation. Hive 0.7 — release notes (March 2011), http://goo.gl/3Sj67
  14. 14.
    Gates, A.: Pig and hive at yahoo (August 2010), http://goo.gl/OVyM1
  15. 15.
    Gates, A.F., Natkovich, O., Chopra, S., Kamath, P., Narayanamurthy, S.M., Olston, C., Reed, B., Srinivasan, S., Srivastava, U.: Building a high-level dataflow system on top of map-reduce: the pig experience. In: Proc. VLDB Endow., vol. 2, pp. 1414–1425 (August 2009)Google Scholar
  16. 16.
    IBM. Jaql — language manual for the join function, http://code.google.com/p/jaql/wiki/LanguageCore#Join
  17. 17.
    Murthy, A.C.: Programming Hadoop Map-Reduce: Programming, Tuning and Debugging. In: ApacheCon US (2008)Google Scholar
  18. 18.
    Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: SIGMOD 2008: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1099–1110. ACM, New York (2008)CrossRefGoogle Scholar
  19. 19.
    Stewart, R.J.: Performance and programmability comparison of mapreduce query languages: Pig, hive, jaql & java. Master’s thesis, Heriot Watt University, Edinburgh, United Kingdom (May 2010), http://www.macs.hw.ac.uk/~rs46/publications.php
  20. 20.
    Stewart, R.J.: Slideshow presentation: Performance results of high level query languages: Pig, hive, and jaql (April 2010) http://goo.gl/XbsmI
  21. 21.
    JAQL Development Team. Email discussion on jaql join runtime performance issues. private communication (September 2010)Google Scholar
  22. 22.
    Pig Development Team. Pig DataGenerator, http://wiki.apache.org/pig/DataGeneratorHadoop
  23. 23.
    Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Anthony, S., Liu, H., Murthy, R.: Hive - a petabyte scale data warehouse using hadoop. In: ICDE, pp. 996–1005 (2010)Google Scholar
  24. 24.
    Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. In: Proc. VLDB Endow., vol. 2(2), pp. 1626–1629 (2009)Google Scholar
  25. 25.
    White, T.: Hadoop — The Definitive Guide: MapReduce for the Cloud. O’Reilly, Sebastopol (2009)Google Scholar
  26. 26.
    Yahoo. Pigmix — unit test benchmarks for pig, http://wiki.apache.org/pig/PigMix
  27. 27.
    Yang, H.-c., Dasdan, A., Hsiao, R.-L., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. In: SIGMOD 2007: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 1029–1040. ACM, New York (2007)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Robert J. Stewart
    • 1
  • Phil W. Trinder
    • 1
  • Hans-Wolfgang Loidl
    • 1
  1. 1.Mathematical and Computer SciencesHeriot Watt UniversityUK

Personalised recommendations