A survey of large-scale analytical query processing in MapReduce

Doulkeridis, Christos; Nørvåg, Kjetil

doi:10.1007/s00778-013-0319-9

A survey of large-scale analytical query processing in MapReduce

Regular Paper
Published: 08 June 2013

Volume 23, pages 355–380, (2014)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Christos Doulkeridis¹ &
Kjetil Nørvåg²

6333 Accesses
153 Citations
9 Altmetric
Explore all metrics

Abstract

Enterprises today acquire vast volumes of data from different sources and leverage this information by means of data analysis to support effective decision-making and provide new functionality and services. The key requirement of data analytics is scalability, simply due to the immense volume of data that need to be extracted, processed, and analyzed in a timely fashion. Arguably the most popular framework for contemporary large-scale data analytics is MapReduce, mainly due to its salient features that include scalability, fault-tolerance, ease of programming, and flexibility. However, despite its merits, MapReduce has evident performance limitations in miscellaneous analytical tasks, and this has given rise to a significant body of research that aim at improving its efficiency, while maintaining its desirable properties. This survey aims to review the state of the art in improving the performance of parallel query processing using MapReduce. A set of the most significant weaknesses and limitations of MapReduce is discussed at a high level, along with solving techniques. A taxonomy is presented for categorizing existing research on MapReduce improvements according to the specific problem they target. Based on the proposed taxonomy, a classification of existing research is provided focusing on the optimization objective. Concluding, we outline interesting directions for future parallel data processing systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Family of Map-Reduce

MapReduce Algorithms for Big Data Analysis

Advances in MapReduce Big Data Processing: Platform, Tools, and Algorithms

Notes

The count is based on articles that appear in the proceedings of ICDE, SIGMOD, VLDB, thus includes research papers, demos, keynotes, and tutorials.
http://zookeeper.apache.org/.

References

Abadi, D.J.: Data management in the cloud: limitations and opportunities. IEEE Data Eng. Bull. 32(1), 3–12 (2009)
Google Scholar
Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D.J., Rasin, A., Silberschatz, A.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc. VLDB Endow. (PVLDB) 2(1), 922–933 (2009)
Afrati, F.N., Borkar, V.R., Carey, M.J., Polyzotis, N., Ullman, J.D.: Map-reduce extensions and recursive queries. In: Proceedings of International Conference on Extending Database Technology (EDBT), pp. 1–8 (2011)
Afrati, F.N., Sarma, A.D., Menestrina, D., Parameswaran, A.G., Ullman, J.D.: Fuzzy joins using MapReduce. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 498–509 (2012)
Afrati, F.N., Ullman, J.D.: Optimizing joins in a Map-Reduce environment. In: Proceedings of International Conference on Extending Database Technology (EDBT), pp. 99–110 (2010)
Afrati, F.N., Ullman, J.D.: Optimizing multiway joins in a Map-Reduce environment. IEEE Trans. Knowl. Data Eng. (TKDE) 23(9), 1282–1298 (2011)
Article Google Scholar
Agarwal, S., Iyer, A.P., Panda, A., Madden, S., Mozafari, B., Stoica, I.: Blink and it’s done: interactive queries on very large data. Proc. VLDB Endow. (PVLDB) 5(12), 1902–1905 (2012)
Agarwal, S., Kandula, S., Bruno, N., Wu, M.-C., Stoica, I., Zhou, J.: Re-optimizing data-parallel computing. In: Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), pp. 21:1–21:14 (2012)
Agarwal, S., Panda, A., Mozafari, B., Milner, H., Madden, S., Stoica, I.: BlinkDB: queries with bounded errors and bounded response times on very large data. In: Proceedings of European Conference on Computer Systems (EuroSys) (2013)
Agrawal, D., Das, S., Abbadi, A.E.: Big data and cloud computing: current state and future opportunities. In: Proceedings of International Conference on Extending Database Technology (EDBT), pp. 530–533 (2011)
Agrawal, P., Kifer, D., Olston, C.: Scheduling shared scans of large data files. Proc. VLDB Endow. (PVLDB) 1(1), 958–969 (2008)
Google Scholar
Ailamaki, A., DeWitt, D.J., Hill, M.D., Skounakis, M.: Weaving relations for cache performance. In: Proceedings of Very Large Databases (VLDB), pp. 169–180 (2001)
Aiyer, A.S., Bautin, M., Chen, G.J., Damania, P., Khemani, P., Muthukkaruppan, K., Ranganathan, K., Spiegelberg, N., Tang, L., Vaidya, M.: Storage infrastructure behind Facebook Messages: using HBase at scale. IEEE Data Eng. Bull. 35(2), 4–13 (2012)
Google Scholar
Ananthanarayanan, G., Ghodsi, A., Wang, A., Borthakur, D., Kandula, S., Shenker, S., Stoica, I.: PACMan: coordinated memory caching for parallel jobs. In: Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), pp. 20:1–20:14 (2012)
Babu, S.: Towards automatic optimization of MapReduce programs. In: ACM Symposium on Cloud Computing (SoCC), pp. 137–142 (2010)
Baraglia, R., Morales, G.D.F., Lucchese, C.: Document similarity self-join with MapReduce. In: IEEE International Conference on Data Mining (ICDM), pp. 731–736 (2010)
Bhatotia, P., Wieder, A., Rodrigues, R., Acar, U.A., Pasquin, R.: Incoop: MapReduce for incremental computations. In: ACM Symposium on Cloud Computing (SoCC), pp. 7:1–7:14 (2011)
Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in MapReduce. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 975–986 (2010)
Borkar, V.R., Carey, M.J., Grover, R., Onose, N., Vernica, R.: Hyracks: a flexible and extensible foundation for data-intensive computing. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 1151–1162 (2011)
Borthakur, D., Gray, J., Sarma, J.S., Muthukkaruppan, K., Spiegelberg, N., Kuang, H., Ranganathan, K., Molkov, D., Menon, A., Rash, S., Schmidt, R., Aiyer, A.S.: Apache Hadoop goes realtime at Facebook. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 1071–1080 (2011)
Bu, Y., Borkar, V.R., Carey, M.J., Rosen, J., Polyzotis, N., Condie, T., Weimer, M., Ramakrishnan, R.: Scaling datalog for machine learning on Big Data. The Computing Research Repository (CoRR), abs/1203.0160 (2012)
Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: HaLoop: efficient iterative data processing on large clusters. Proc. VLDB Endow. (PVLDB) 3(1), 285–296 (2010)
Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: The HaLoop approach to large-scale iterative data analysis. VLDB J. 21(2), 169–190 (2012)
Article Google Scholar
Candan, K.S., Kim, J.W., Nagarkar, P., Nagendra, M., Yu, R.: RanKloud: scalable multimedia data processing in server clusters. IEEE Multimed. 18(1), 64–77 (2011)
Article Google Scholar
Cattell, R.: Scalable SQL and NoSQL data stores. SIGMOD Rec. 39(4), 12–27 (2010)
Article Google Scholar
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2), 4:1–4:26 (2008)
Google Scholar
Chattopadhyay, B., Lin, L., Liu, W., Mittal, S., Aragonda, P., Lychagina, V., Kwon, Y., Wong, M.: Tenzing a SQL implementation on the MapReduce framework. Proc. VLDB Endow. (PVLDB) 4(12), 1318–1327 (2011)
Google Scholar
Chen, S.: Cheetah: a high performance, custom data warehouse on top of MapReduce. Proc. VLDB Endow. (PVLDB) 3(2), 1459–1468 (2010)
Chih Yang, H., Dasdan, A., Hsiao, R.-L., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 1029–1040 (2007)
Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce online. In: Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), pp. 313–328 (2010)
Cooper, B.F., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P., Jacobsen, H.-A., Puz, N., Weaver, D., Yerneni, R.: PNUTS: Yahoo!’s hosted data serving platform. Proc. VLDB Endow. (PVLDB) 1(2), 1277–1288 (2008)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI) (2004)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)
Article Google Scholar
Dittrich, J., Quiané-Ruiz, J.-A.: Efficient Big Data processing in Hadoop MapReduce. Proc. VLDB Endow. (PVLDB) 5(12), 2014–2015 (2012)
Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). Proc. VLDB Endow. (PVLDB) 3(1), 518–529 (2010)
Dittrich, J., Quiané-Ruiz, J.-A., Richter, S., Schuh, S., Jindal, A., Schad, J.: Only aggressive elephants are fast elephants. Proc. VLDB Endow. (PVLDB) 5(11), 1591–1602 (2012)
Google Scholar
Doulkeridis, C., Nørvåg, K.: On saying “enough already!” in MapReduce. In: Proceedings of International Workshop on Cloud Intelligence (Cloud-I), pp. 7:1–7:4 (2012)
Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., Fox, G.: Twister: a runtime for iterative MapReduce. In: International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC), pp. 810–818 (2010)
Elghandour, I., Aboulnaga, A.: ReStore: reusing results of MapReduce jobs. Proc. VLDB Endow. (PVLDB) 5(6), 586–597 (2012)
Eltabakh, M.Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A., McPherson, J.: CoHadoop: flexible data placement and its exploitation in Hadoop. Proc. VLDB Endow. (PVLDB) 4(9), 575–585 (2011)
Engle, C., Lupher, A., Xin, R., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I.: Shark: fast data analysis using coarse-grained distributed memory. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 689–692 (2012)
Ewen, S., Tzoumas, K., Kaufmann, M., Markl, V.: Spinning fast iterative data flows. Proc. VLDB Endow. (PVLDB) 5(11), 1268–1279 (2012)
Floratou, A., Patel, J.M., Shekita, E.J., Tata, S.: Column-oriented storage techniques for MapReduce. Proc. VLDB Endow. (PVLDB) 4(7), 419–429 (2011)
George, L.: HBase: The Definitive Guide: Random Access to Your Planet-Size Data. O’Reilly, Ireland (2011)
Goodhope, K., Koshy, J., Kreps, J., Narkhede, N., Park, R., Rao, J., Ye, V.Y.: Building LinkedIn’s real-time activity data pipeline. IEEE Data Eng. Bull. 35(2), 33–45 (2012)
Google Scholar
Grover, R., Carey, M.J.: Extending map-reduce for efficient predicate-based sampling. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 486–497 (2012)
Gufler, B., Augsten, N., Reiser, A., Kemper, A.: Load balancing in MapReduce based on scalable cardinality estimates. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 522–533 (2012)
Hall, A., Bachmann, O., Büssow, R., Ganceanu, S., Nunkesser, M.: Processing a trillion cells per mouse click. Proc. VLDB Endow. (PVLDB) 5(11), 1436–1446 (2012)
Google Scholar
He, Y., Lee, R., Huai, Y., Shao, Z., Jain, N., Zhang, X., Xu, Z.: RCFile: a fast and space-efficient data placement structure in MapReduce-based warehouse systems. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 1199–1208 (2011)
Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of MapReduce programs. Proc. VLDB Endow. (PVLDB) 4(11), 1111–1122 (2011)
Hueske, F., Peters, M., Sax, M., Rheinländer, A., Bergmann, R., Krettek, A., Tzoumas, K.: Opening the black boxes in data flow optimization. Proc. VLDB End. (PVLDB) 5(11), 1256–1267 (2012)
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of European Conference on Computer Systems (EuroSys), pp. 59–72 (2007)
Iu, M.-Y., Zwaenepoel, W.: HadoopToSQL: a MapReduce query optimizer. In: Proceedings of European Conference on Computer systems (EuroSys), pp. 251–264 (2010)
Jacox, E.H., Samet, H.: Metric space similarity joins. ACM Trans. Database Syst. (TODS) 33(2), 7:1–7:38 (2008)
Article Google Scholar
Jahani, E., Cafarella, M.J., Ré, C.: Automatic optimization for MapReduce programs. Proc. VLDB Endow. (PVLDB) 4(6), 385–396 (2011)
Jiang, D., Ooi, B.C., Shi, L., Wu, S.: The performance of MapReduce: an in-depth study. Proc. VLDB Endow. (PVLDB) 3(1), 472–483 (2010)
Google Scholar
Jiang, D., Tung, A.K.H., Chen, G.: MAP-JOIN-REDUCE: toward scalable and efficient data analysis on large clusters. IEEE Trans. Knowl. Data Eng. (TKDE) 23(9), 1299–1311 (2011)
Article Google Scholar
Jindal, A., Quiané-Ruiz, J.-A., Dittrich, J.: Trojan data layouts: right shoes for a running elephant. In: ACM Symposium on Cloud Computing (SoCC), pp. 21:1–21:14 (2011)
Kaldewey, T., Shekita, E.J., Tata, S.: Clydesdale: structured data processing on MapReduce. In: Proceedings of International Conference on Extending Database Technology (EDBT), pp. 15–25 (2012)
Kim, Y., Shim, K.: Parallel top-k similarity join algorithms using MapReduce. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 510–521 (2012)
Kolb, L., Thor, A., Rahm, E.: Load balancing for MapReduce-based entity resolution. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 618–629 (2012)
Kornacker, M., Erickson, J.: Cloudera Impala: real-time queries in Apache Hadoop, for real. http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/
Kwon, Y., Balazinska, M., Howe, B., Rolia, J.A.: Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In: ACM Symposium on Cloud Computing (SoCC), pp. 75–86 (2010)
Kwon, Y., Balazinska, M., Howe, B., Rolia, J.A.: SkewTune: mitigating skew in MapReduce applications. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 25–36 (2012)
Lam, W., Liu, L., Prasad, S., Rajaraman, A., Vacheri, Z., Doan, A.: Muppet: MapReduce-style processing of fast data. Proc. VLDB Endow. (PVLDB) 5(12), 1814–1825 (2012)
Laptev, N., Zeng, K., Zaniolo, C.: Early accurate results for advanced analytics on MapReduce. Proc. VLDB Endow. (PVLDB) 5(10), 1028–1039 (2012)
Lee, K.-H., Lee, Y.-J., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing with MapReduce: a survey. SIGMOD Rec. 40(4), 11–20 (2011)
Article Google Scholar
Lee, R., Luo, T., Huai, Y., Wang, F., He, Y., Zhang, X.: YSmart: yet another SQL-to-MapReduce translator. In: Proceedings of International Conference on Distributed Computing Systems (ICDCS), pp. 25–36 (2011)
Leibiusky, J., Eisbruch, G., Simonassi, D.: Getting Started with Storm. O’Reilly, Ireland (2012)
Li, B., Mazur, E., Diao, Y., McGregor, A., Shenoy, P.J.: A platform for scalable one-pass analytics using MapReduce. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 985–996 (2011)
Li, B., Mazur, E., Diao, Y., McGregor, A., Shenoy, P.J.: SCALLA: a platform for scalable one-pass analytics using MapReduce. ACM Trans. Database Syst. (TODS) 37(4), 27:1–27:38 (2012)
Google Scholar
Lim, H., Herodotou, H., Babu, S.: Stubby: a transformation-based optimizer for MapReduce workflows. Proc. VLDB Endow. (PVLDB) 5(11), 1196–1207 (2012)
Lin, Y., Agrawal, D., Chen, C., Ooi, B.C., Wu, S.: Llama: leveraging columnar storage for scalable join processing in the MapReduce framework. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 961–972 (2011)
Logothetis, D., Olston, C., Reed, B., Webb, K.C., Yocum, K.: Stateful bulk processing for incremental analytics. In: ACM Symposium on Cloud Computing (SoCC), pp. 51–62 (2010)
Logothetis, D., Yocum, K.: Ad-hoc data processing in the cloud. Proc. VLDB Endow. (PVLDB) 1(2), 1472–1475 (2008)
Lu, W., Shen, Y., Chen, S., Ooi, B.C.: Efficient processing of k nearest neighbor joins using MapReduce. Proc. VLDB Endow. (PVLDB) 5(10), 1016–1027 (2012)
Google Scholar
Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 135–146 (2010)
McSherry, F., Murray, D.G., Isaacs, R., Isard, M.: Differential dataflow. In: Biennial Conference on Innovative Data Systems Research (CIDR) (2013)
Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. Proc. VLDB Endow. (PVLDB) 3(1–2), 330–339 (2010)
Metwally, A., Faloutsos, C.: V-SMART-Join: a scalable MapReduce framework for all-pair similarity joins of multisets and vectors. Proc. VLDB Endow. (PVLDB) 5(8), 704–715 (2012)
Mihaylov, S.R., Ives, Z.G., Guha, S.: REX: recursive, delta-based data-centric computation. Proc. VLDB Endow. (PVLDB) 5(11), 1280–1291 (2012)
Google Scholar
Nykiel, T., Potamias, M., Mishra, C., Kollios, G., Koudas, N.: MRShare: sharing across multiple queries in MapReduce. Proc. VLDB Endow. (PVLDB) 3(1), 494–505 (2010)
Okcan, A., Riedewald, M.: Processing theta-joins using MapReduce. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 949–960 (2011)
Olston, C., Chiou, G., Chitnis, L., Liu, F., Han, Y., Larsson, M., Neumann, A., Rao, V.B.N., Sankarasubramanian, V., Seth, S., Tian, C., ZiCornell, T., Wang, X.: Nova: continuous Pig/Hadoop workflows. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 1081–1090 (2011)
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: a not-so-foreign language for data processing. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 1099–1110 (2008)
Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 165–178 (2009)
Ramakrishnan, S.R., Swart, G., Urmanov, A.: Balancing reducer skew in MapReduce workloads using progressive sampling. In: ACM Symposium on Cloud Computing (SoCC), pp. 16:1–16:13 (2012)
Rao, S., Ramakrishnan, R., Silberstein, A., Ovsiannikov, M., Reeves, D.: Sailfish: a framework for large scale data processing. In: ACM Symposium on Cloud Computing (SoCC), pp. 4:1–4:13 (2012)
Rasmussen, A., Lam, V.T., Conley, M., Porter, G., Kapoor, R., Vahdat, A.: Themis: an I/O efficient MapReduce. In: ACM Symposium on Cloud Computing (SoCC), pp. 13:1–13:14 (2012)
Sakr, S., Liu, A., Batista, D.M., Alomari, M.: A survey of large scale data management approaches in cloud environments. IEEE Commun. Surv. Tutor. 13(3), 311–336 (2011)
Article Google Scholar
Schindler, J.: I/O characteristics of NoSQL databases. Proc. VLDB Endow. (PVLDB) 5(12), 2020–2021 (2012)
Shim, K.: MapReduce algorithms for Big Data analysis. Proc. VLDB Endow. (PVLDB) 5(12), 2016–2017 (2012)
Shinnar, A., Cunningham, D., Herta, B., Saraswat, V.A.: M3R: increased performance for in-memory Hadoop jobs. Proc. VLDB End. (PVLDB) 5(12), 1736–1747 (2012)
Silva, Y.N., Larson, P.-A., Zhou, J.: Exploiting common subexpressions for cloud query processing. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 1337–1348 (2012)
Silva, Y.N., Reed, J.M.: Exploiting MapReduce-based similarity joins. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 693–696 (2012)
Silva, Y.N., Reed, J.M., Tsosie, L.M.: MapReduce-based similarity join for metric spaces. In: Proceedings of International Workshop on Cloud Intelligence (Cloud-I), pp. 3:1–3:8 (2012)
Stonebraker, M., Abadi, D.J., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and parallel DBMSs: friends or foes? Commun. ACM 53(1), 64–71 (2010)
Article Google Scholar
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive—a warehousing solution over a Map-Reduce framework. Proc. VLDB Endow. (PVLDB) 2(2), 1626–1629 (2009)
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Anthony, S., Liu, H., Murthy, R.: Hive—a petabyte scale data warehouse using Hadoop. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 996–1005 (2010)
Vernica, R., Balmin, A., Beyer, K.S., Ercegovac, V.: Adaptive MapReduce using situation-aware mappers. In: Proceedings of International Conference on Extending Database Technology (EDBT), pp. 420–431 (2012)
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 495–506 (2010)
Vlachou, A., Doulkeridis, C., Kotidis, Y.: Angle-based space partitioning for efficient parallel skyline computation. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 227–238 (2008)
White, T.: Hadoop—The Definitive Guide: Storage and Analysis at Internet Scale (3. ed., revised and updated). O’Reilly, Ireland (2012)
Wu, S., Li, F., Mehrotra, S., Ooi, B.C.: Query optimization for massively parallel data processing. In: ACM Symposium on Cloud Computing (SoCC), pp. 12:1–12:13 (2011)
Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: International World Wide Web Conferences (WWW), pp. 131–140 (2008)
Xin, R., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I.: Shark: SQL and rich analytics at scale. The Computing Research Repository (CoRR), abs/1211.6176 (2012)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M., Shenker, S., Stoica, I.: Fast and interactive analytics over Hadoop data with Spark. USENIX; login 37(4), 45–51 (2012)
Google Scholar
Zhang, J., Zhou, H., Chen, R., Fan, X., Guo, Z., Lin, H., Li, J.Y., Lin, W., Zhou, J., Zhou, L.: Optimizing data shuffling in data-parallel computation by understanding user-defined functions. In: Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), pp. 22:1–22:14 (2012)
Zhang, X., Chen, L., Wang, M.: Efficient multiway theta-join processing using MapReduce. Proc. VLDB Endow. (PVLDB) 5(11), 1184–1195 (2012)
Google Scholar
Zhang, Y., Gao, Q., Gao, L., Wang, C.: PrIter: a distributed framework for prioritized iterative computations. In: ACM Symposium on Cloud Computing (SoCC), pp. 13:1–13:14 (2011)
Zhou, J., Bruno, N., Wu, M.-C., Larson, P.-Å., Chaiken, R., Shakib, D.: SCOPE: parallel databases meet MapReduce. VLDB J. 21(5), 611–636 (2012)
Article Google Scholar

Download references

Acknowledgments

We would like to thank the editors and the anonymous reviewers for their very helpful comments that have significantly improved this paper. The research of C. Doulkeridis was supported under the Marie-Curie IEF grant number 274063 with partial support from the Norwegian Research Council.

Author information

Authors and Affiliations

Department of Digital Systems, University of Piraeus, 18534 , Piraeus, Greece
Christos Doulkeridis
Department of Computer and Information Science, Norwegian University of Science and Technology, 7491 , Trondheim, Norway
Kjetil Nørvåg

Authors

Christos Doulkeridis
View author publications
You can also search for this author in PubMed Google Scholar
Kjetil Nørvåg
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christos Doulkeridis.

Appendix

See Tables 4 and 5.

Table 4 Modifications induced by existing approaches to MapReduce

Full size table

Table 5 Overview of join processing in MapReduce

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Doulkeridis, C., Nørvåg, K. A survey of large-scale analytical query processing in MapReduce. The VLDB Journal 23, 355–380 (2014). https://doi.org/10.1007/s00778-013-0319-9

Download citation

Received: 28 September 2012
Revised: 28 February 2013
Accepted: 13 May 2013
Published: 08 June 2013
Issue Date: June 2014
DOI: https://doi.org/10.1007/s00778-013-0319-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey of large-scale analytical query processing in MapReduce

Abstract

Access this article

Similar content being viewed by others

The Family of Map-Reduce

MapReduce Algorithms for Big Data Analysis

Advances in MapReduce Big Data Processing: Platform, Tools, and Algorithms

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A survey of large-scale analytical query processing in MapReduce

Abstract

Access this article

Similar content being viewed by others

The Family of Map-Reduce

MapReduce Algorithms for Big Data Analysis

Advances in MapReduce Big Data Processing: Platform, Tools, and Algorithms

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation