Abstract
Enterprises today acquire vast volumes of data from different sources and leverage this information by means of data analysis to support effective decision-making and provide new functionality and services. The key requirement of data analytics is scalability, simply due to the immense volume of data that need to be extracted, processed, and analyzed in a timely fashion. Arguably the most popular framework for contemporary large-scale data analytics is MapReduce, mainly due to its salient features that include scalability, fault-tolerance, ease of programming, and flexibility. However, despite its merits, MapReduce has evident performance limitations in miscellaneous analytical tasks, and this has given rise to a significant body of research that aim at improving its efficiency, while maintaining its desirable properties. This survey aims to review the state of the art in improving the performance of parallel query processing using MapReduce. A set of the most significant weaknesses and limitations of MapReduce is discussed at a high level, along with solving techniques. A taxonomy is presented for categorizing existing research on MapReduce improvements according to the specific problem they target. Based on the proposed taxonomy, a classification of existing research is provided focusing on the optimization objective. Concluding, we outline interesting directions for future parallel data processing systems.
Similar content being viewed by others
Notes
The count is based on articles that appear in the proceedings of ICDE, SIGMOD, VLDB, thus includes research papers, demos, keynotes, and tutorials.
References
Abadi, D.J.: Data management in the cloud: limitations and opportunities. IEEE Data Eng. Bull. 32(1), 3–12 (2009)
Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D.J., Rasin, A., Silberschatz, A.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc. VLDB Endow. (PVLDB) 2(1), 922–933 (2009)
Afrati, F.N., Borkar, V.R., Carey, M.J., Polyzotis, N., Ullman, J.D.: Map-reduce extensions and recursive queries. In: Proceedings of International Conference on Extending Database Technology (EDBT), pp. 1–8 (2011)
Afrati, F.N., Sarma, A.D., Menestrina, D., Parameswaran, A.G., Ullman, J.D.: Fuzzy joins using MapReduce. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 498–509 (2012)
Afrati, F.N., Ullman, J.D.: Optimizing joins in a Map-Reduce environment. In: Proceedings of International Conference on Extending Database Technology (EDBT), pp. 99–110 (2010)
Afrati, F.N., Ullman, J.D.: Optimizing multiway joins in a Map-Reduce environment. IEEE Trans. Knowl. Data Eng. (TKDE) 23(9), 1282–1298 (2011)
Agarwal, S., Iyer, A.P., Panda, A., Madden, S., Mozafari, B., Stoica, I.: Blink and it’s done: interactive queries on very large data. Proc. VLDB Endow. (PVLDB) 5(12), 1902–1905 (2012)
Agarwal, S., Kandula, S., Bruno, N., Wu, M.-C., Stoica, I., Zhou, J.: Re-optimizing data-parallel computing. In: Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), pp. 21:1–21:14 (2012)
Agarwal, S., Panda, A., Mozafari, B., Milner, H., Madden, S., Stoica, I.: BlinkDB: queries with bounded errors and bounded response times on very large data. In: Proceedings of European Conference on Computer Systems (EuroSys) (2013)
Agrawal, D., Das, S., Abbadi, A.E.: Big data and cloud computing: current state and future opportunities. In: Proceedings of International Conference on Extending Database Technology (EDBT), pp. 530–533 (2011)
Agrawal, P., Kifer, D., Olston, C.: Scheduling shared scans of large data files. Proc. VLDB Endow. (PVLDB) 1(1), 958–969 (2008)
Ailamaki, A., DeWitt, D.J., Hill, M.D., Skounakis, M.: Weaving relations for cache performance. In: Proceedings of Very Large Databases (VLDB), pp. 169–180 (2001)
Aiyer, A.S., Bautin, M., Chen, G.J., Damania, P., Khemani, P., Muthukkaruppan, K., Ranganathan, K., Spiegelberg, N., Tang, L., Vaidya, M.: Storage infrastructure behind Facebook Messages: using HBase at scale. IEEE Data Eng. Bull. 35(2), 4–13 (2012)
Ananthanarayanan, G., Ghodsi, A., Wang, A., Borthakur, D., Kandula, S., Shenker, S., Stoica, I.: PACMan: coordinated memory caching for parallel jobs. In: Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), pp. 20:1–20:14 (2012)
Babu, S.: Towards automatic optimization of MapReduce programs. In: ACM Symposium on Cloud Computing (SoCC), pp. 137–142 (2010)
Baraglia, R., Morales, G.D.F., Lucchese, C.: Document similarity self-join with MapReduce. In: IEEE International Conference on Data Mining (ICDM), pp. 731–736 (2010)
Bhatotia, P., Wieder, A., Rodrigues, R., Acar, U.A., Pasquin, R.: Incoop: MapReduce for incremental computations. In: ACM Symposium on Cloud Computing (SoCC), pp. 7:1–7:14 (2011)
Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in MapReduce. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 975–986 (2010)
Borkar, V.R., Carey, M.J., Grover, R., Onose, N., Vernica, R.: Hyracks: a flexible and extensible foundation for data-intensive computing. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 1151–1162 (2011)
Borthakur, D., Gray, J., Sarma, J.S., Muthukkaruppan, K., Spiegelberg, N., Kuang, H., Ranganathan, K., Molkov, D., Menon, A., Rash, S., Schmidt, R., Aiyer, A.S.: Apache Hadoop goes realtime at Facebook. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 1071–1080 (2011)
Bu, Y., Borkar, V.R., Carey, M.J., Rosen, J., Polyzotis, N., Condie, T., Weimer, M., Ramakrishnan, R.: Scaling datalog for machine learning on Big Data. The Computing Research Repository (CoRR), abs/1203.0160 (2012)
Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: HaLoop: efficient iterative data processing on large clusters. Proc. VLDB Endow. (PVLDB) 3(1), 285–296 (2010)
Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: The HaLoop approach to large-scale iterative data analysis. VLDB J. 21(2), 169–190 (2012)
Candan, K.S., Kim, J.W., Nagarkar, P., Nagendra, M., Yu, R.: RanKloud: scalable multimedia data processing in server clusters. IEEE Multimed. 18(1), 64–77 (2011)
Cattell, R.: Scalable SQL and NoSQL data stores. SIGMOD Rec. 39(4), 12–27 (2010)
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2), 4:1–4:26 (2008)
Chattopadhyay, B., Lin, L., Liu, W., Mittal, S., Aragonda, P., Lychagina, V., Kwon, Y., Wong, M.: Tenzing a SQL implementation on the MapReduce framework. Proc. VLDB Endow. (PVLDB) 4(12), 1318–1327 (2011)
Chen, S.: Cheetah: a high performance, custom data warehouse on top of MapReduce. Proc. VLDB Endow. (PVLDB) 3(2), 1459–1468 (2010)
Chih Yang, H., Dasdan, A., Hsiao, R.-L., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 1029–1040 (2007)
Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce online. In: Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), pp. 313–328 (2010)
Cooper, B.F., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P., Jacobsen, H.-A., Puz, N., Weaver, D., Yerneni, R.: PNUTS: Yahoo!’s hosted data serving platform. Proc. VLDB Endow. (PVLDB) 1(2), 1277–1288 (2008)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI) (2004)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)
Dittrich, J., Quiané-Ruiz, J.-A.: Efficient Big Data processing in Hadoop MapReduce. Proc. VLDB Endow. (PVLDB) 5(12), 2014–2015 (2012)
Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). Proc. VLDB Endow. (PVLDB) 3(1), 518–529 (2010)
Dittrich, J., Quiané-Ruiz, J.-A., Richter, S., Schuh, S., Jindal, A., Schad, J.: Only aggressive elephants are fast elephants. Proc. VLDB Endow. (PVLDB) 5(11), 1591–1602 (2012)
Doulkeridis, C., Nørvåg, K.: On saying “enough already!” in MapReduce. In: Proceedings of International Workshop on Cloud Intelligence (Cloud-I), pp. 7:1–7:4 (2012)
Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., Fox, G.: Twister: a runtime for iterative MapReduce. In: International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC), pp. 810–818 (2010)
Elghandour, I., Aboulnaga, A.: ReStore: reusing results of MapReduce jobs. Proc. VLDB Endow. (PVLDB) 5(6), 586–597 (2012)
Eltabakh, M.Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A., McPherson, J.: CoHadoop: flexible data placement and its exploitation in Hadoop. Proc. VLDB Endow. (PVLDB) 4(9), 575–585 (2011)
Engle, C., Lupher, A., Xin, R., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I.: Shark: fast data analysis using coarse-grained distributed memory. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 689–692 (2012)
Ewen, S., Tzoumas, K., Kaufmann, M., Markl, V.: Spinning fast iterative data flows. Proc. VLDB Endow. (PVLDB) 5(11), 1268–1279 (2012)
Floratou, A., Patel, J.M., Shekita, E.J., Tata, S.: Column-oriented storage techniques for MapReduce. Proc. VLDB Endow. (PVLDB) 4(7), 419–429 (2011)
George, L.: HBase: The Definitive Guide: Random Access to Your Planet-Size Data. O’Reilly, Ireland (2011)
Goodhope, K., Koshy, J., Kreps, J., Narkhede, N., Park, R., Rao, J., Ye, V.Y.: Building LinkedIn’s real-time activity data pipeline. IEEE Data Eng. Bull. 35(2), 33–45 (2012)
Grover, R., Carey, M.J.: Extending map-reduce for efficient predicate-based sampling. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 486–497 (2012)
Gufler, B., Augsten, N., Reiser, A., Kemper, A.: Load balancing in MapReduce based on scalable cardinality estimates. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 522–533 (2012)
Hall, A., Bachmann, O., Büssow, R., Ganceanu, S., Nunkesser, M.: Processing a trillion cells per mouse click. Proc. VLDB Endow. (PVLDB) 5(11), 1436–1446 (2012)
He, Y., Lee, R., Huai, Y., Shao, Z., Jain, N., Zhang, X., Xu, Z.: RCFile: a fast and space-efficient data placement structure in MapReduce-based warehouse systems. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 1199–1208 (2011)
Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of MapReduce programs. Proc. VLDB Endow. (PVLDB) 4(11), 1111–1122 (2011)
Hueske, F., Peters, M., Sax, M., Rheinländer, A., Bergmann, R., Krettek, A., Tzoumas, K.: Opening the black boxes in data flow optimization. Proc. VLDB End. (PVLDB) 5(11), 1256–1267 (2012)
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of European Conference on Computer Systems (EuroSys), pp. 59–72 (2007)
Iu, M.-Y., Zwaenepoel, W.: HadoopToSQL: a MapReduce query optimizer. In: Proceedings of European Conference on Computer systems (EuroSys), pp. 251–264 (2010)
Jacox, E.H., Samet, H.: Metric space similarity joins. ACM Trans. Database Syst. (TODS) 33(2), 7:1–7:38 (2008)
Jahani, E., Cafarella, M.J., Ré, C.: Automatic optimization for MapReduce programs. Proc. VLDB Endow. (PVLDB) 4(6), 385–396 (2011)
Jiang, D., Ooi, B.C., Shi, L., Wu, S.: The performance of MapReduce: an in-depth study. Proc. VLDB Endow. (PVLDB) 3(1), 472–483 (2010)
Jiang, D., Tung, A.K.H., Chen, G.: MAP-JOIN-REDUCE: toward scalable and efficient data analysis on large clusters. IEEE Trans. Knowl. Data Eng. (TKDE) 23(9), 1299–1311 (2011)
Jindal, A., Quiané-Ruiz, J.-A., Dittrich, J.: Trojan data layouts: right shoes for a running elephant. In: ACM Symposium on Cloud Computing (SoCC), pp. 21:1–21:14 (2011)
Kaldewey, T., Shekita, E.J., Tata, S.: Clydesdale: structured data processing on MapReduce. In: Proceedings of International Conference on Extending Database Technology (EDBT), pp. 15–25 (2012)
Kim, Y., Shim, K.: Parallel top-k similarity join algorithms using MapReduce. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 510–521 (2012)
Kolb, L., Thor, A., Rahm, E.: Load balancing for MapReduce-based entity resolution. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 618–629 (2012)
Kornacker, M., Erickson, J.: Cloudera Impala: real-time queries in Apache Hadoop, for real. http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/
Kwon, Y., Balazinska, M., Howe, B., Rolia, J.A.: Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In: ACM Symposium on Cloud Computing (SoCC), pp. 75–86 (2010)
Kwon, Y., Balazinska, M., Howe, B., Rolia, J.A.: SkewTune: mitigating skew in MapReduce applications. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 25–36 (2012)
Lam, W., Liu, L., Prasad, S., Rajaraman, A., Vacheri, Z., Doan, A.: Muppet: MapReduce-style processing of fast data. Proc. VLDB Endow. (PVLDB) 5(12), 1814–1825 (2012)
Laptev, N., Zeng, K., Zaniolo, C.: Early accurate results for advanced analytics on MapReduce. Proc. VLDB Endow. (PVLDB) 5(10), 1028–1039 (2012)
Lee, K.-H., Lee, Y.-J., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing with MapReduce: a survey. SIGMOD Rec. 40(4), 11–20 (2011)
Lee, R., Luo, T., Huai, Y., Wang, F., He, Y., Zhang, X.: YSmart: yet another SQL-to-MapReduce translator. In: Proceedings of International Conference on Distributed Computing Systems (ICDCS), pp. 25–36 (2011)
Leibiusky, J., Eisbruch, G., Simonassi, D.: Getting Started with Storm. O’Reilly, Ireland (2012)
Li, B., Mazur, E., Diao, Y., McGregor, A., Shenoy, P.J.: A platform for scalable one-pass analytics using MapReduce. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 985–996 (2011)
Li, B., Mazur, E., Diao, Y., McGregor, A., Shenoy, P.J.: SCALLA: a platform for scalable one-pass analytics using MapReduce. ACM Trans. Database Syst. (TODS) 37(4), 27:1–27:38 (2012)
Lim, H., Herodotou, H., Babu, S.: Stubby: a transformation-based optimizer for MapReduce workflows. Proc. VLDB Endow. (PVLDB) 5(11), 1196–1207 (2012)
Lin, Y., Agrawal, D., Chen, C., Ooi, B.C., Wu, S.: Llama: leveraging columnar storage for scalable join processing in the MapReduce framework. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 961–972 (2011)
Logothetis, D., Olston, C., Reed, B., Webb, K.C., Yocum, K.: Stateful bulk processing for incremental analytics. In: ACM Symposium on Cloud Computing (SoCC), pp. 51–62 (2010)
Logothetis, D., Yocum, K.: Ad-hoc data processing in the cloud. Proc. VLDB Endow. (PVLDB) 1(2), 1472–1475 (2008)
Lu, W., Shen, Y., Chen, S., Ooi, B.C.: Efficient processing of k nearest neighbor joins using MapReduce. Proc. VLDB Endow. (PVLDB) 5(10), 1016–1027 (2012)
Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 135–146 (2010)
McSherry, F., Murray, D.G., Isaacs, R., Isard, M.: Differential dataflow. In: Biennial Conference on Innovative Data Systems Research (CIDR) (2013)
Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. Proc. VLDB Endow. (PVLDB) 3(1–2), 330–339 (2010)
Metwally, A., Faloutsos, C.: V-SMART-Join: a scalable MapReduce framework for all-pair similarity joins of multisets and vectors. Proc. VLDB Endow. (PVLDB) 5(8), 704–715 (2012)
Mihaylov, S.R., Ives, Z.G., Guha, S.: REX: recursive, delta-based data-centric computation. Proc. VLDB Endow. (PVLDB) 5(11), 1280–1291 (2012)
Nykiel, T., Potamias, M., Mishra, C., Kollios, G., Koudas, N.: MRShare: sharing across multiple queries in MapReduce. Proc. VLDB Endow. (PVLDB) 3(1), 494–505 (2010)
Okcan, A., Riedewald, M.: Processing theta-joins using MapReduce. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 949–960 (2011)
Olston, C., Chiou, G., Chitnis, L., Liu, F., Han, Y., Larsson, M., Neumann, A., Rao, V.B.N., Sankarasubramanian, V., Seth, S., Tian, C., ZiCornell, T., Wang, X.: Nova: continuous Pig/Hadoop workflows. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 1081–1090 (2011)
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: a not-so-foreign language for data processing. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 1099–1110 (2008)
Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 165–178 (2009)
Ramakrishnan, S.R., Swart, G., Urmanov, A.: Balancing reducer skew in MapReduce workloads using progressive sampling. In: ACM Symposium on Cloud Computing (SoCC), pp. 16:1–16:13 (2012)
Rao, S., Ramakrishnan, R., Silberstein, A., Ovsiannikov, M., Reeves, D.: Sailfish: a framework for large scale data processing. In: ACM Symposium on Cloud Computing (SoCC), pp. 4:1–4:13 (2012)
Rasmussen, A., Lam, V.T., Conley, M., Porter, G., Kapoor, R., Vahdat, A.: Themis: an I/O efficient MapReduce. In: ACM Symposium on Cloud Computing (SoCC), pp. 13:1–13:14 (2012)
Sakr, S., Liu, A., Batista, D.M., Alomari, M.: A survey of large scale data management approaches in cloud environments. IEEE Commun. Surv. Tutor. 13(3), 311–336 (2011)
Schindler, J.: I/O characteristics of NoSQL databases. Proc. VLDB Endow. (PVLDB) 5(12), 2020–2021 (2012)
Shim, K.: MapReduce algorithms for Big Data analysis. Proc. VLDB Endow. (PVLDB) 5(12), 2016–2017 (2012)
Shinnar, A., Cunningham, D., Herta, B., Saraswat, V.A.: M3R: increased performance for in-memory Hadoop jobs. Proc. VLDB End. (PVLDB) 5(12), 1736–1747 (2012)
Silva, Y.N., Larson, P.-A., Zhou, J.: Exploiting common subexpressions for cloud query processing. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 1337–1348 (2012)
Silva, Y.N., Reed, J.M.: Exploiting MapReduce-based similarity joins. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 693–696 (2012)
Silva, Y.N., Reed, J.M., Tsosie, L.M.: MapReduce-based similarity join for metric spaces. In: Proceedings of International Workshop on Cloud Intelligence (Cloud-I), pp. 3:1–3:8 (2012)
Stonebraker, M., Abadi, D.J., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and parallel DBMSs: friends or foes? Commun. ACM 53(1), 64–71 (2010)
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive—a warehousing solution over a Map-Reduce framework. Proc. VLDB Endow. (PVLDB) 2(2), 1626–1629 (2009)
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Anthony, S., Liu, H., Murthy, R.: Hive—a petabyte scale data warehouse using Hadoop. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 996–1005 (2010)
Vernica, R., Balmin, A., Beyer, K.S., Ercegovac, V.: Adaptive MapReduce using situation-aware mappers. In: Proceedings of International Conference on Extending Database Technology (EDBT), pp. 420–431 (2012)
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 495–506 (2010)
Vlachou, A., Doulkeridis, C., Kotidis, Y.: Angle-based space partitioning for efficient parallel skyline computation. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 227–238 (2008)
White, T.: Hadoop—The Definitive Guide: Storage and Analysis at Internet Scale (3. ed., revised and updated). O’Reilly, Ireland (2012)
Wu, S., Li, F., Mehrotra, S., Ooi, B.C.: Query optimization for massively parallel data processing. In: ACM Symposium on Cloud Computing (SoCC), pp. 12:1–12:13 (2011)
Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: International World Wide Web Conferences (WWW), pp. 131–140 (2008)
Xin, R., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I.: Shark: SQL and rich analytics at scale. The Computing Research Repository (CoRR), abs/1211.6176 (2012)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M., Shenker, S., Stoica, I.: Fast and interactive analytics over Hadoop data with Spark. USENIX; login 37(4), 45–51 (2012)
Zhang, J., Zhou, H., Chen, R., Fan, X., Guo, Z., Lin, H., Li, J.Y., Lin, W., Zhou, J., Zhou, L.: Optimizing data shuffling in data-parallel computation by understanding user-defined functions. In: Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), pp. 22:1–22:14 (2012)
Zhang, X., Chen, L., Wang, M.: Efficient multiway theta-join processing using MapReduce. Proc. VLDB Endow. (PVLDB) 5(11), 1184–1195 (2012)
Zhang, Y., Gao, Q., Gao, L., Wang, C.: PrIter: a distributed framework for prioritized iterative computations. In: ACM Symposium on Cloud Computing (SoCC), pp. 13:1–13:14 (2011)
Zhou, J., Bruno, N., Wu, M.-C., Larson, P.-Å., Chaiken, R., Shakib, D.: SCOPE: parallel databases meet MapReduce. VLDB J. 21(5), 611–636 (2012)
Acknowledgments
We would like to thank the editors and the anonymous reviewers for their very helpful comments that have significantly improved this paper. The research of C. Doulkeridis was supported under the Marie-Curie IEF grant number 274063 with partial support from the Norwegian Research Council.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Doulkeridis, C., Nørvåg, K. A survey of large-scale analytical query processing in MapReduce. The VLDB Journal 23, 355–380 (2014). https://doi.org/10.1007/s00778-013-0319-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-013-0319-9