, Volume 13, Issue 1, pp 5–15 | Cite as

Compilation of Query Languages into MapReduce

  • Caetano Sauer
  • Theo HärderEmail author


The introduction of MapReduce as a tool for Big Data Analytics, combined with the new requirements of emerging application scenarios such as the Web 2.0 and scientific computing, has motivated the development of data processing languages which are more flexible and widely applicable than SQL. Based on the Big Data context, we discuss the points in which SQL is considered too restrictive. Furthermore, we provide a qualitative evaluation of how recent query languages overcome these restrictions. Having established the desired characteristics of a query language, we provide an abstract description of the compilation into the MapReduce programming model, which, up to minor variations, is essentially the same in all approaches. Given the requirements of query processing, we introduce simple generalizations of the model, which allow the reuse of well-established query evaluation techniques, and discuss strategies to generate optimized MapReduce plans.


Query Processing Query Language Task Function Query Optimization Query Plan 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Afrati FN, Ullman JD (2011) Optimizing multiway joins in a map-reduce environment. IEEE Trans Knowl Data Eng 23(9):1282–1298 CrossRefGoogle Scholar
  2. 2.
    Bächle S (2012) Separating key concerns in query processing—set orientation, physical data independence, and parallelism. PhD thesis, University of Kaiserslautern, Germany Google Scholar
  3. 3.
    Battré D, Ewen S, Hueske F, Kao O, Markl V, Warneke D (2010) Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. In: SoCC, pp 119–130 CrossRefGoogle Scholar
  4. 4.
    Beyer KS, Ercegovac V, Gemulla R, Balmin A, Eltabakh MY, Kanne CC, Özcan F, Shekita EJ (2011) Jaql: a scripting language for large-scale semistructured data analysis. Proc VLDB Endow 4(12):1272–1283 Google Scholar
  5. 5.
    Dean J, Ghemawat S (2010) MapReduce: a flexible data processing tool. Commun ACM 53(1):72–77 CrossRefGoogle Scholar
  6. 6.
    Dittrich J, Quiané-Ruiz JA, Jindal A, Kargin Y, Setty V, Schad J (2010) Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). Proc VLDB Endow 3(1):518–529 Google Scholar
  7. 7.
    Gates A, Natkovich O, Chopra S, Kamath P, Narayanam S, Olston C, Reed B, Srinivasan S, Srivastava U (2009) Building a high-level dataflow system on top of MapReduce: the pig experience. Proc VLDB Endow 2(2):1414–1425 Google Scholar
  8. 8.
    Graefe G (1993) Query evaluation techniques for large databases. ACM Comput Surv 25(2):73–170 CrossRefGoogle Scholar
  9. 9.
    Herodotou H, Babu S (2011) Profiling, what-if analysis, and cost-based optimization of MapReduce programs. Proc VLDB Endow 4(11):1111–1122 Google Scholar
  10. 10.
    Hueske F, Peters M, Sax M, Rheinländer A, Bergmann R, Krettek A, Tzoumas K (2012) Opening the black boxes in data flow optimization. Proc VLDB Endow 5(11):1256–1267 Google Scholar
  11. 11.
    Isard M, Budiu M, Yu Y, Birrell A, Fetterly D (2007) Dryad: distributed data-parallel programs from sequential building blocks. In: EuroSys, pp 59–72 CrossRefGoogle Scholar
  12. 12.
    Jahani E, Cafarella MJ, Ré C (2011) Automatic optimization for MapReduce programs. Proc VLDB Endow 4(6):385–396 Google Scholar
  13. 13.
    Lämmel R (2008) Google’s MapReduce programming mModel—revisited. Sci Comput Program 70(1):1–30 zbMATHCrossRefGoogle Scholar
  14. 14.
    Okcan A, Riedewald M (2011) Processing theta-joins using MapReduce. In: SIGMOD conference, pp 949–960 Google Scholar
  15. 15.
    Olston C, Reed B, Srivastava U, Kumar R, Tomkins A (2008) Pig Latin: a not-so-foreign language for data processing. In: SIGMOD conference, pp 1099–1110 CrossRefGoogle Scholar
  16. 16.
    Pike R, Dorward S, Griesemer R, Quinlan S (2005) Interpreting the data: parallel analysis with Sawzall. Sci Program 13(4):277–298 Google Scholar
  17. 17.
    Sauer C, Bächle S, Härder T (2012) Versatile query processing in the MapReduce framework based on XQuery (submitted) Google Scholar
  18. 18.
    Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Zhang N, Anthony S, Liu H, Murthy R (2010) Hive—a petabyte scale data warehouse using Hadoop. In: ICDE conference, pp 996–1005 Google Scholar
  19. 19.
    W3C (2011) XQuery 3.0: an XML query language.
  20. 20.
    W3C (2011) XQuery and XPath data model 3.0.
  21. 21.
    White T (2011) Hadoop—the definitive guide: storage and analysis at Internet scale, 2nd edn. O’Reilly, Sebastopol Google Scholar
  22. 22.
    Zhang X, Chen L, Wang M (2012) Efficient multi-way theta-join processing using MapReduce. Proc VLDB Endow 5(11):1184–1195 Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  1. 1.University of KaiserslauternKaiserslauternGermany

Personalised recommendations