Skip to main content
Log in

A survey of large-scale analytical query processing in MapReduce

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Enterprises today acquire vast volumes of data from different sources and leverage this information by means of data analysis to support effective decision-making and provide new functionality and services. The key requirement of data analytics is scalability, simply due to the immense volume of data that need to be extracted, processed, and analyzed in a timely fashion. Arguably the most popular framework for contemporary large-scale data analytics is MapReduce, mainly due to its salient features that include scalability, fault-tolerance, ease of programming, and flexibility. However, despite its merits, MapReduce has evident performance limitations in miscellaneous analytical tasks, and this has given rise to a significant body of research that aim at improving its efficiency, while maintaining its desirable properties. This survey aims to review the state of the art in improving the performance of parallel query processing using MapReduce. A set of the most significant weaknesses and limitations of MapReduce is discussed at a high level, along with solving techniques. A taxonomy is presented for categorizing existing research on MapReduce improvements according to the specific problem they target. Based on the proposed taxonomy, a classification of existing research is provided focusing on the optimization objective. Concluding, we outline interesting directions for future parallel data processing systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. The count is based on articles that appear in the proceedings of ICDE, SIGMOD, VLDB, thus includes research papers, demos, keynotes, and tutorials.

  2. http://zookeeper.apache.org/.

References

  1. Abadi, D.J.: Data management in the cloud: limitations and opportunities. IEEE Data Eng. Bull. 32(1), 3–12 (2009)

    Google Scholar 

  2. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D.J., Rasin, A., Silberschatz, A.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc. VLDB Endow. (PVLDB) 2(1), 922–933 (2009)

  3. Afrati, F.N., Borkar, V.R., Carey, M.J., Polyzotis, N., Ullman, J.D.: Map-reduce extensions and recursive queries. In: Proceedings of International Conference on Extending Database Technology (EDBT), pp. 1–8 (2011)

  4. Afrati, F.N., Sarma, A.D., Menestrina, D., Parameswaran, A.G., Ullman, J.D.: Fuzzy joins using MapReduce. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 498–509 (2012)

  5. Afrati, F.N., Ullman, J.D.: Optimizing joins in a Map-Reduce environment. In: Proceedings of International Conference on Extending Database Technology (EDBT), pp. 99–110 (2010)

  6. Afrati, F.N., Ullman, J.D.: Optimizing multiway joins in a Map-Reduce environment. IEEE Trans. Knowl. Data Eng. (TKDE) 23(9), 1282–1298 (2011)

    Article  Google Scholar 

  7. Agarwal, S., Iyer, A.P., Panda, A., Madden, S., Mozafari, B., Stoica, I.: Blink and it’s done: interactive queries on very large data. Proc. VLDB Endow. (PVLDB) 5(12), 1902–1905 (2012)

  8. Agarwal, S., Kandula, S., Bruno, N., Wu, M.-C., Stoica, I., Zhou, J.: Re-optimizing data-parallel computing. In: Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), pp. 21:1–21:14 (2012)

  9. Agarwal, S., Panda, A., Mozafari, B., Milner, H., Madden, S., Stoica, I.: BlinkDB: queries with bounded errors and bounded response times on very large data. In: Proceedings of European Conference on Computer Systems (EuroSys) (2013)

  10. Agrawal, D., Das, S., Abbadi, A.E.: Big data and cloud computing: current state and future opportunities. In: Proceedings of International Conference on Extending Database Technology (EDBT), pp. 530–533 (2011)

  11. Agrawal, P., Kifer, D., Olston, C.: Scheduling shared scans of large data files. Proc. VLDB Endow. (PVLDB) 1(1), 958–969 (2008)

    Google Scholar 

  12. Ailamaki, A., DeWitt, D.J., Hill, M.D., Skounakis, M.: Weaving relations for cache performance. In: Proceedings of Very Large Databases (VLDB), pp. 169–180 (2001)

  13. Aiyer, A.S., Bautin, M., Chen, G.J., Damania, P., Khemani, P., Muthukkaruppan, K., Ranganathan, K., Spiegelberg, N., Tang, L., Vaidya, M.: Storage infrastructure behind Facebook Messages: using HBase at scale. IEEE Data Eng. Bull. 35(2), 4–13 (2012)

    Google Scholar 

  14. Ananthanarayanan, G., Ghodsi, A., Wang, A., Borthakur, D., Kandula, S., Shenker, S., Stoica, I.: PACMan: coordinated memory caching for parallel jobs. In: Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), pp. 20:1–20:14 (2012)

  15. Babu, S.: Towards automatic optimization of MapReduce programs. In: ACM Symposium on Cloud Computing (SoCC), pp. 137–142 (2010)

  16. Baraglia, R., Morales, G.D.F., Lucchese, C.: Document similarity self-join with MapReduce. In: IEEE International Conference on Data Mining (ICDM), pp. 731–736 (2010)

  17. Bhatotia, P., Wieder, A., Rodrigues, R., Acar, U.A., Pasquin, R.: Incoop: MapReduce for incremental computations. In: ACM Symposium on Cloud Computing (SoCC), pp. 7:1–7:14 (2011)

  18. Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in MapReduce. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 975–986 (2010)

  19. Borkar, V.R., Carey, M.J., Grover, R., Onose, N., Vernica, R.: Hyracks: a flexible and extensible foundation for data-intensive computing. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 1151–1162 (2011)

  20. Borthakur, D., Gray, J., Sarma, J.S., Muthukkaruppan, K., Spiegelberg, N., Kuang, H., Ranganathan, K., Molkov, D., Menon, A., Rash, S., Schmidt, R., Aiyer, A.S.: Apache Hadoop goes realtime at Facebook. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 1071–1080 (2011)

  21. Bu, Y., Borkar, V.R., Carey, M.J., Rosen, J., Polyzotis, N., Condie, T., Weimer, M., Ramakrishnan, R.: Scaling datalog for machine learning on Big Data. The Computing Research Repository (CoRR), abs/1203.0160 (2012)

  22. Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: HaLoop: efficient iterative data processing on large clusters. Proc. VLDB Endow. (PVLDB) 3(1), 285–296 (2010)

  23. Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: The HaLoop approach to large-scale iterative data analysis. VLDB J. 21(2), 169–190 (2012)

    Article  Google Scholar 

  24. Candan, K.S., Kim, J.W., Nagarkar, P., Nagendra, M., Yu, R.: RanKloud: scalable multimedia data processing in server clusters. IEEE Multimed. 18(1), 64–77 (2011)

    Article  Google Scholar 

  25. Cattell, R.: Scalable SQL and NoSQL data stores. SIGMOD Rec. 39(4), 12–27 (2010)

    Article  Google Scholar 

  26. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2), 4:1–4:26 (2008)

    Google Scholar 

  27. Chattopadhyay, B., Lin, L., Liu, W., Mittal, S., Aragonda, P., Lychagina, V., Kwon, Y., Wong, M.: Tenzing a SQL implementation on the MapReduce framework. Proc. VLDB Endow. (PVLDB) 4(12), 1318–1327 (2011)

    Google Scholar 

  28. Chen, S.: Cheetah: a high performance, custom data warehouse on top of MapReduce. Proc. VLDB Endow. (PVLDB) 3(2), 1459–1468 (2010)

  29. Chih Yang, H., Dasdan, A., Hsiao, R.-L., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 1029–1040 (2007)

  30. Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce online. In: Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), pp. 313–328 (2010)

  31. Cooper, B.F., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P., Jacobsen, H.-A., Puz, N., Weaver, D., Yerneni, R.: PNUTS: Yahoo!’s hosted data serving platform. Proc. VLDB Endow. (PVLDB) 1(2), 1277–1288 (2008)

  32. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI) (2004)

  33. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  34. Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)

    Article  Google Scholar 

  35. Dittrich, J., Quiané-Ruiz, J.-A.: Efficient Big Data processing in Hadoop MapReduce. Proc. VLDB Endow. (PVLDB) 5(12), 2014–2015 (2012)

  36. Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). Proc. VLDB Endow. (PVLDB) 3(1), 518–529 (2010)

  37. Dittrich, J., Quiané-Ruiz, J.-A., Richter, S., Schuh, S., Jindal, A., Schad, J.: Only aggressive elephants are fast elephants. Proc. VLDB Endow. (PVLDB) 5(11), 1591–1602 (2012)

    Google Scholar 

  38. Doulkeridis, C., Nørvåg, K.: On saying “enough already!” in MapReduce. In: Proceedings of International Workshop on Cloud Intelligence (Cloud-I), pp. 7:1–7:4 (2012)

  39. Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., Fox, G.: Twister: a runtime for iterative MapReduce. In: International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC), pp. 810–818 (2010)

  40. Elghandour, I., Aboulnaga, A.: ReStore: reusing results of MapReduce jobs. Proc. VLDB Endow. (PVLDB) 5(6), 586–597 (2012)

  41. Eltabakh, M.Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A., McPherson, J.: CoHadoop: flexible data placement and its exploitation in Hadoop. Proc. VLDB Endow. (PVLDB) 4(9), 575–585 (2011)

  42. Engle, C., Lupher, A., Xin, R., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I.: Shark: fast data analysis using coarse-grained distributed memory. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 689–692 (2012)

  43. Ewen, S., Tzoumas, K., Kaufmann, M., Markl, V.: Spinning fast iterative data flows. Proc. VLDB Endow. (PVLDB) 5(11), 1268–1279 (2012)

  44. Floratou, A., Patel, J.M., Shekita, E.J., Tata, S.: Column-oriented storage techniques for MapReduce. Proc. VLDB Endow. (PVLDB) 4(7), 419–429 (2011)

  45. George, L.: HBase: The Definitive Guide: Random Access to Your Planet-Size Data. O’Reilly, Ireland (2011)

  46. Goodhope, K., Koshy, J., Kreps, J., Narkhede, N., Park, R., Rao, J., Ye, V.Y.: Building LinkedIn’s real-time activity data pipeline. IEEE Data Eng. Bull. 35(2), 33–45 (2012)

    Google Scholar 

  47. Grover, R., Carey, M.J.: Extending map-reduce for efficient predicate-based sampling. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 486–497 (2012)

  48. Gufler, B., Augsten, N., Reiser, A., Kemper, A.: Load balancing in MapReduce based on scalable cardinality estimates. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 522–533 (2012)

  49. Hall, A., Bachmann, O., Büssow, R., Ganceanu, S., Nunkesser, M.: Processing a trillion cells per mouse click. Proc. VLDB Endow. (PVLDB) 5(11), 1436–1446 (2012)

    Google Scholar 

  50. He, Y., Lee, R., Huai, Y., Shao, Z., Jain, N., Zhang, X., Xu, Z.: RCFile: a fast and space-efficient data placement structure in MapReduce-based warehouse systems. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 1199–1208 (2011)

  51. Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of MapReduce programs. Proc. VLDB Endow. (PVLDB) 4(11), 1111–1122 (2011)

  52. Hueske, F., Peters, M., Sax, M., Rheinländer, A., Bergmann, R., Krettek, A., Tzoumas, K.: Opening the black boxes in data flow optimization. Proc. VLDB End. (PVLDB) 5(11), 1256–1267 (2012)

  53. Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of European Conference on Computer Systems (EuroSys), pp. 59–72 (2007)

  54. Iu, M.-Y., Zwaenepoel, W.: HadoopToSQL: a MapReduce query optimizer. In: Proceedings of European Conference on Computer systems (EuroSys), pp. 251–264 (2010)

  55. Jacox, E.H., Samet, H.: Metric space similarity joins. ACM Trans. Database Syst. (TODS) 33(2), 7:1–7:38 (2008)

    Article  Google Scholar 

  56. Jahani, E., Cafarella, M.J., Ré, C.: Automatic optimization for MapReduce programs. Proc. VLDB Endow. (PVLDB) 4(6), 385–396 (2011)

  57. Jiang, D., Ooi, B.C., Shi, L., Wu, S.: The performance of MapReduce: an in-depth study. Proc. VLDB Endow. (PVLDB) 3(1), 472–483 (2010)

    Google Scholar 

  58. Jiang, D., Tung, A.K.H., Chen, G.: MAP-JOIN-REDUCE: toward scalable and efficient data analysis on large clusters. IEEE Trans. Knowl. Data Eng. (TKDE) 23(9), 1299–1311 (2011)

    Article  Google Scholar 

  59. Jindal, A., Quiané-Ruiz, J.-A., Dittrich, J.: Trojan data layouts: right shoes for a running elephant. In: ACM Symposium on Cloud Computing (SoCC), pp. 21:1–21:14 (2011)

  60. Kaldewey, T., Shekita, E.J., Tata, S.: Clydesdale: structured data processing on MapReduce. In: Proceedings of International Conference on Extending Database Technology (EDBT), pp. 15–25 (2012)

  61. Kim, Y., Shim, K.: Parallel top-k similarity join algorithms using MapReduce. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 510–521 (2012)

  62. Kolb, L., Thor, A., Rahm, E.: Load balancing for MapReduce-based entity resolution. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 618–629 (2012)

  63. Kornacker, M., Erickson, J.: Cloudera Impala: real-time queries in Apache Hadoop, for real. http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/

  64. Kwon, Y., Balazinska, M., Howe, B., Rolia, J.A.: Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In: ACM Symposium on Cloud Computing (SoCC), pp. 75–86 (2010)

  65. Kwon, Y., Balazinska, M., Howe, B., Rolia, J.A.: SkewTune: mitigating skew in MapReduce applications. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 25–36 (2012)

  66. Lam, W., Liu, L., Prasad, S., Rajaraman, A., Vacheri, Z., Doan, A.: Muppet: MapReduce-style processing of fast data. Proc. VLDB Endow. (PVLDB) 5(12), 1814–1825 (2012)

  67. Laptev, N., Zeng, K., Zaniolo, C.: Early accurate results for advanced analytics on MapReduce. Proc. VLDB Endow. (PVLDB) 5(10), 1028–1039 (2012)

  68. Lee, K.-H., Lee, Y.-J., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing with MapReduce: a survey. SIGMOD Rec. 40(4), 11–20 (2011)

    Article  Google Scholar 

  69. Lee, R., Luo, T., Huai, Y., Wang, F., He, Y., Zhang, X.: YSmart: yet another SQL-to-MapReduce translator. In: Proceedings of International Conference on Distributed Computing Systems (ICDCS), pp. 25–36 (2011)

  70. Leibiusky, J., Eisbruch, G., Simonassi, D.: Getting Started with Storm. O’Reilly, Ireland (2012)

  71. Li, B., Mazur, E., Diao, Y., McGregor, A., Shenoy, P.J.: A platform for scalable one-pass analytics using MapReduce. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 985–996 (2011)

  72. Li, B., Mazur, E., Diao, Y., McGregor, A., Shenoy, P.J.: SCALLA: a platform for scalable one-pass analytics using MapReduce. ACM Trans. Database Syst. (TODS) 37(4), 27:1–27:38 (2012)

    Google Scholar 

  73. Lim, H., Herodotou, H., Babu, S.: Stubby: a transformation-based optimizer for MapReduce workflows. Proc. VLDB Endow. (PVLDB) 5(11), 1196–1207 (2012)

  74. Lin, Y., Agrawal, D., Chen, C., Ooi, B.C., Wu, S.: Llama: leveraging columnar storage for scalable join processing in the MapReduce framework. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 961–972 (2011)

  75. Logothetis, D., Olston, C., Reed, B., Webb, K.C., Yocum, K.: Stateful bulk processing for incremental analytics. In: ACM Symposium on Cloud Computing (SoCC), pp. 51–62 (2010)

  76. Logothetis, D., Yocum, K.: Ad-hoc data processing in the cloud. Proc. VLDB Endow. (PVLDB) 1(2), 1472–1475 (2008)

  77. Lu, W., Shen, Y., Chen, S., Ooi, B.C.: Efficient processing of k nearest neighbor joins using MapReduce. Proc. VLDB Endow. (PVLDB) 5(10), 1016–1027 (2012)

    Google Scholar 

  78. Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 135–146 (2010)

  79. McSherry, F., Murray, D.G., Isaacs, R., Isard, M.: Differential dataflow. In: Biennial Conference on Innovative Data Systems Research (CIDR) (2013)

  80. Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. Proc. VLDB Endow. (PVLDB) 3(1–2), 330–339 (2010)

  81. Metwally, A., Faloutsos, C.: V-SMART-Join: a scalable MapReduce framework for all-pair similarity joins of multisets and vectors. Proc. VLDB Endow. (PVLDB) 5(8), 704–715 (2012)

  82. Mihaylov, S.R., Ives, Z.G., Guha, S.: REX: recursive, delta-based data-centric computation. Proc. VLDB Endow. (PVLDB) 5(11), 1280–1291 (2012)

    Google Scholar 

  83. Nykiel, T., Potamias, M., Mishra, C., Kollios, G., Koudas, N.: MRShare: sharing across multiple queries in MapReduce. Proc. VLDB Endow. (PVLDB) 3(1), 494–505 (2010)

  84. Okcan, A., Riedewald, M.: Processing theta-joins using MapReduce. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 949–960 (2011)

  85. Olston, C., Chiou, G., Chitnis, L., Liu, F., Han, Y., Larsson, M., Neumann, A., Rao, V.B.N., Sankarasubramanian, V., Seth, S., Tian, C., ZiCornell, T., Wang, X.: Nova: continuous Pig/Hadoop workflows. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 1081–1090 (2011)

  86. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: a not-so-foreign language for data processing. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 1099–1110 (2008)

  87. Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 165–178 (2009)

  88. Ramakrishnan, S.R., Swart, G., Urmanov, A.: Balancing reducer skew in MapReduce workloads using progressive sampling. In: ACM Symposium on Cloud Computing (SoCC), pp. 16:1–16:13 (2012)

  89. Rao, S., Ramakrishnan, R., Silberstein, A., Ovsiannikov, M., Reeves, D.: Sailfish: a framework for large scale data processing. In: ACM Symposium on Cloud Computing (SoCC), pp. 4:1–4:13 (2012)

  90. Rasmussen, A., Lam, V.T., Conley, M., Porter, G., Kapoor, R., Vahdat, A.: Themis: an I/O efficient MapReduce. In: ACM Symposium on Cloud Computing (SoCC), pp. 13:1–13:14 (2012)

  91. Sakr, S., Liu, A., Batista, D.M., Alomari, M.: A survey of large scale data management approaches in cloud environments. IEEE Commun. Surv. Tutor. 13(3), 311–336 (2011)

    Article  Google Scholar 

  92. Schindler, J.: I/O characteristics of NoSQL databases. Proc. VLDB Endow. (PVLDB) 5(12), 2020–2021 (2012)

  93. Shim, K.: MapReduce algorithms for Big Data analysis. Proc. VLDB Endow. (PVLDB) 5(12), 2016–2017 (2012)

  94. Shinnar, A., Cunningham, D., Herta, B., Saraswat, V.A.: M3R: increased performance for in-memory Hadoop jobs. Proc. VLDB End. (PVLDB) 5(12), 1736–1747 (2012)

  95. Silva, Y.N., Larson, P.-A., Zhou, J.: Exploiting common subexpressions for cloud query processing. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 1337–1348 (2012)

  96. Silva, Y.N., Reed, J.M.: Exploiting MapReduce-based similarity joins. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 693–696 (2012)

  97. Silva, Y.N., Reed, J.M., Tsosie, L.M.: MapReduce-based similarity join for metric spaces. In: Proceedings of International Workshop on Cloud Intelligence (Cloud-I), pp. 3:1–3:8 (2012)

  98. Stonebraker, M., Abadi, D.J., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and parallel DBMSs: friends or foes? Commun. ACM 53(1), 64–71 (2010)

    Article  Google Scholar 

  99. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive—a warehousing solution over a Map-Reduce framework. Proc. VLDB Endow. (PVLDB) 2(2), 1626–1629 (2009)

  100. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Anthony, S., Liu, H., Murthy, R.: Hive—a petabyte scale data warehouse using Hadoop. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 996–1005 (2010)

  101. Vernica, R., Balmin, A., Beyer, K.S., Ercegovac, V.: Adaptive MapReduce using situation-aware mappers. In: Proceedings of International Conference on Extending Database Technology (EDBT), pp. 420–431 (2012)

  102. Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 495–506 (2010)

  103. Vlachou, A., Doulkeridis, C., Kotidis, Y.: Angle-based space partitioning for efficient parallel skyline computation. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 227–238 (2008)

  104. White, T.: Hadoop—The Definitive Guide: Storage and Analysis at Internet Scale (3. ed., revised and updated). O’Reilly, Ireland (2012)

  105. Wu, S., Li, F., Mehrotra, S., Ooi, B.C.: Query optimization for massively parallel data processing. In: ACM Symposium on Cloud Computing (SoCC), pp. 12:1–12:13 (2011)

  106. Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: International World Wide Web Conferences (WWW), pp. 131–140 (2008)

  107. Xin, R., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I.: Shark: SQL and rich analytics at scale. The Computing Research Repository (CoRR), abs/1211.6176 (2012)

  108. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M., Shenker, S., Stoica, I.: Fast and interactive analytics over Hadoop data with Spark. USENIX; login 37(4), 45–51 (2012)

    Google Scholar 

  109. Zhang, J., Zhou, H., Chen, R., Fan, X., Guo, Z., Lin, H., Li, J.Y., Lin, W., Zhou, J., Zhou, L.: Optimizing data shuffling in data-parallel computation by understanding user-defined functions. In: Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), pp. 22:1–22:14 (2012)

  110. Zhang, X., Chen, L., Wang, M.: Efficient multiway theta-join processing using MapReduce. Proc. VLDB Endow. (PVLDB) 5(11), 1184–1195 (2012)

    Google Scholar 

  111. Zhang, Y., Gao, Q., Gao, L., Wang, C.: PrIter: a distributed framework for prioritized iterative computations. In: ACM Symposium on Cloud Computing (SoCC), pp. 13:1–13:14 (2011)

  112. Zhou, J., Bruno, N., Wu, M.-C., Larson, P.-Å., Chaiken, R., Shakib, D.: SCOPE: parallel databases meet MapReduce. VLDB J. 21(5), 611–636 (2012)

    Article  Google Scholar 

Download references

Acknowledgments

We would like to thank the editors and the anonymous reviewers for their very helpful comments that have significantly improved this paper. The research of C. Doulkeridis was supported under the Marie-Curie IEF grant number 274063 with partial support from the Norwegian Research Council.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christos Doulkeridis.

Appendix

Appendix

See Tables 4 and 5.

Table 4 Modifications induced by existing approaches to MapReduce
Table 5 Overview of join processing in MapReduce

Rights and permissions

Reprints and permissions

About this article

Cite this article

Doulkeridis, C., Nørvåg, K. A survey of large-scale analytical query processing in MapReduce. The VLDB Journal 23, 355–380 (2014). https://doi.org/10.1007/s00778-013-0319-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-013-0319-9

Keywords

Navigation