The VLDB Journal

, Volume 23, Issue 3, pp 355–380 | Cite as

A survey of large-scale analytical query processing in MapReduce

Regular Paper

Abstract

Enterprises today acquire vast volumes of data from different sources and leverage this information by means of data analysis to support effective decision-making and provide new functionality and services. The key requirement of data analytics is scalability, simply due to the immense volume of data that need to be extracted, processed, and analyzed in a timely fashion. Arguably the most popular framework for contemporary large-scale data analytics is MapReduce, mainly due to its salient features that include scalability, fault-tolerance, ease of programming, and flexibility. However, despite its merits, MapReduce has evident performance limitations in miscellaneous analytical tasks, and this has given rise to a significant body of research that aim at improving its efficiency, while maintaining its desirable properties. This survey aims to review the state of the art in improving the performance of parallel query processing using MapReduce. A set of the most significant weaknesses and limitations of MapReduce is discussed at a high level, along with solving techniques. A taxonomy is presented for categorizing existing research on MapReduce improvements according to the specific problem they target. Based on the proposed taxonomy, a classification of existing research is provided focusing on the optimization objective. Concluding, we outline interesting directions for future parallel data processing systems.

Keywords

MapReduce Survey Data analysis Query processing Large-scale Big Data 

References

  1. 1.
    Abadi, D.J.: Data management in the cloud: limitations and opportunities. IEEE Data Eng. Bull. 32(1), 3–12 (2009)Google Scholar
  2. 2.
    Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D.J., Rasin, A., Silberschatz, A.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc. VLDB Endow. (PVLDB) 2(1), 922–933 (2009)Google Scholar
  3. 3.
    Afrati, F.N., Borkar, V.R., Carey, M.J., Polyzotis, N., Ullman, J.D.: Map-reduce extensions and recursive queries. In: Proceedings of International Conference on Extending Database Technology (EDBT), pp. 1–8 (2011)Google Scholar
  4. 4.
    Afrati, F.N., Sarma, A.D., Menestrina, D., Parameswaran, A.G., Ullman, J.D.: Fuzzy joins using MapReduce. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 498–509 (2012)Google Scholar
  5. 5.
    Afrati, F.N., Ullman, J.D.: Optimizing joins in a Map-Reduce environment. In: Proceedings of International Conference on Extending Database Technology (EDBT), pp. 99–110 (2010)Google Scholar
  6. 6.
    Afrati, F.N., Ullman, J.D.: Optimizing multiway joins in a Map-Reduce environment. IEEE Trans. Knowl. Data Eng. (TKDE) 23(9), 1282–1298 (2011)CrossRefGoogle Scholar
  7. 7.
    Agarwal, S., Iyer, A.P., Panda, A., Madden, S., Mozafari, B., Stoica, I.: Blink and it’s done: interactive queries on very large data. Proc. VLDB Endow. (PVLDB) 5(12), 1902–1905 (2012)Google Scholar
  8. 8.
    Agarwal, S., Kandula, S., Bruno, N., Wu, M.-C., Stoica, I., Zhou, J.: Re-optimizing data-parallel computing. In: Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), pp. 21:1–21:14 (2012)Google Scholar
  9. 9.
    Agarwal, S., Panda, A., Mozafari, B., Milner, H., Madden, S., Stoica, I.: BlinkDB: queries with bounded errors and bounded response times on very large data. In: Proceedings of European Conference on Computer Systems (EuroSys) (2013)Google Scholar
  10. 10.
    Agrawal, D., Das, S., Abbadi, A.E.: Big data and cloud computing: current state and future opportunities. In: Proceedings of International Conference on Extending Database Technology (EDBT), pp. 530–533 (2011)Google Scholar
  11. 11.
    Agrawal, P., Kifer, D., Olston, C.: Scheduling shared scans of large data files. Proc. VLDB Endow. (PVLDB) 1(1), 958–969 (2008)Google Scholar
  12. 12.
    Ailamaki, A., DeWitt, D.J., Hill, M.D., Skounakis, M.: Weaving relations for cache performance. In: Proceedings of Very Large Databases (VLDB), pp. 169–180 (2001)Google Scholar
  13. 13.
    Aiyer, A.S., Bautin, M., Chen, G.J., Damania, P., Khemani, P., Muthukkaruppan, K., Ranganathan, K., Spiegelberg, N., Tang, L., Vaidya, M.: Storage infrastructure behind Facebook Messages: using HBase at scale. IEEE Data Eng. Bull. 35(2), 4–13 (2012)Google Scholar
  14. 14.
    Ananthanarayanan, G., Ghodsi, A., Wang, A., Borthakur, D., Kandula, S., Shenker, S., Stoica, I.: PACMan: coordinated memory caching for parallel jobs. In: Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), pp. 20:1–20:14 (2012)Google Scholar
  15. 15.
    Babu, S.: Towards automatic optimization of MapReduce programs. In: ACM Symposium on Cloud Computing (SoCC), pp. 137–142 (2010)Google Scholar
  16. 16.
    Baraglia, R., Morales, G.D.F., Lucchese, C.: Document similarity self-join with MapReduce. In: IEEE International Conference on Data Mining (ICDM), pp. 731–736 (2010)Google Scholar
  17. 17.
    Bhatotia, P., Wieder, A., Rodrigues, R., Acar, U.A., Pasquin, R.: Incoop: MapReduce for incremental computations. In: ACM Symposium on Cloud Computing (SoCC), pp. 7:1–7:14 (2011)Google Scholar
  18. 18.
    Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in MapReduce. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 975–986 (2010)Google Scholar
  19. 19.
    Borkar, V.R., Carey, M.J., Grover, R., Onose, N., Vernica, R.: Hyracks: a flexible and extensible foundation for data-intensive computing. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 1151–1162 (2011)Google Scholar
  20. 20.
    Borthakur, D., Gray, J., Sarma, J.S., Muthukkaruppan, K., Spiegelberg, N., Kuang, H., Ranganathan, K., Molkov, D., Menon, A., Rash, S., Schmidt, R., Aiyer, A.S.: Apache Hadoop goes realtime at Facebook. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 1071–1080 (2011)Google Scholar
  21. 21.
    Bu, Y., Borkar, V.R., Carey, M.J., Rosen, J., Polyzotis, N., Condie, T., Weimer, M., Ramakrishnan, R.: Scaling datalog for machine learning on Big Data. The Computing Research Repository (CoRR), abs/1203.0160 (2012)Google Scholar
  22. 22.
    Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: HaLoop: efficient iterative data processing on large clusters. Proc. VLDB Endow. (PVLDB) 3(1), 285–296 (2010)Google Scholar
  23. 23.
    Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: The HaLoop approach to large-scale iterative data analysis. VLDB J. 21(2), 169–190 (2012)CrossRefGoogle Scholar
  24. 24.
    Candan, K.S., Kim, J.W., Nagarkar, P., Nagendra, M., Yu, R.: RanKloud: scalable multimedia data processing in server clusters. IEEE Multimed. 18(1), 64–77 (2011)CrossRefGoogle Scholar
  25. 25.
    Cattell, R.: Scalable SQL and NoSQL data stores. SIGMOD Rec. 39(4), 12–27 (2010)CrossRefGoogle Scholar
  26. 26.
    Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2), 4:1–4:26 (2008)Google Scholar
  27. 27.
    Chattopadhyay, B., Lin, L., Liu, W., Mittal, S., Aragonda, P., Lychagina, V., Kwon, Y., Wong, M.: Tenzing a SQL implementation on the MapReduce framework. Proc. VLDB Endow. (PVLDB) 4(12), 1318–1327 (2011)Google Scholar
  28. 28.
    Chen, S.: Cheetah: a high performance, custom data warehouse on top of MapReduce. Proc. VLDB Endow. (PVLDB) 3(2), 1459–1468 (2010)Google Scholar
  29. 29.
    Chih Yang, H., Dasdan, A., Hsiao, R.-L., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 1029–1040 (2007)Google Scholar
  30. 30.
    Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce online. In: Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), pp. 313–328 (2010)Google Scholar
  31. 31.
    Cooper, B.F., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P., Jacobsen, H.-A., Puz, N., Weaver, D., Yerneni, R.: PNUTS: Yahoo!’s hosted data serving platform. Proc. VLDB Endow. (PVLDB) 1(2), 1277–1288 (2008)Google Scholar
  32. 32.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI) (2004)Google Scholar
  33. 33.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  34. 34.
    Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)CrossRefGoogle Scholar
  35. 35.
    Dittrich, J., Quiané-Ruiz, J.-A.: Efficient Big Data processing in Hadoop MapReduce. Proc. VLDB Endow. (PVLDB) 5(12), 2014–2015 (2012)Google Scholar
  36. 36.
    Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). Proc. VLDB Endow. (PVLDB) 3(1), 518–529 (2010)Google Scholar
  37. 37.
    Dittrich, J., Quiané-Ruiz, J.-A., Richter, S., Schuh, S., Jindal, A., Schad, J.: Only aggressive elephants are fast elephants. Proc. VLDB Endow. (PVLDB) 5(11), 1591–1602 (2012)Google Scholar
  38. 38.
    Doulkeridis, C., Nørvåg, K.: On saying “enough already!” in MapReduce. In: Proceedings of International Workshop on Cloud Intelligence (Cloud-I), pp. 7:1–7:4 (2012)Google Scholar
  39. 39.
    Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., Fox, G.: Twister: a runtime for iterative MapReduce. In: International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC), pp. 810–818 (2010)Google Scholar
  40. 40.
    Elghandour, I., Aboulnaga, A.: ReStore: reusing results of MapReduce jobs. Proc. VLDB Endow. (PVLDB) 5(6), 586–597 (2012)Google Scholar
  41. 41.
    Eltabakh, M.Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A., McPherson, J.: CoHadoop: flexible data placement and its exploitation in Hadoop. Proc. VLDB Endow. (PVLDB) 4(9), 575–585 (2011)Google Scholar
  42. 42.
    Engle, C., Lupher, A., Xin, R., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I.: Shark: fast data analysis using coarse-grained distributed memory. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 689–692 (2012)Google Scholar
  43. 43.
    Ewen, S., Tzoumas, K., Kaufmann, M., Markl, V.: Spinning fast iterative data flows. Proc. VLDB Endow. (PVLDB) 5(11), 1268–1279 (2012)Google Scholar
  44. 44.
    Floratou, A., Patel, J.M., Shekita, E.J., Tata, S.: Column-oriented storage techniques for MapReduce. Proc. VLDB Endow. (PVLDB) 4(7), 419–429 (2011)Google Scholar
  45. 45.
    George, L.: HBase: The Definitive Guide: Random Access to Your Planet-Size Data. O’Reilly, Ireland (2011)Google Scholar
  46. 46.
    Goodhope, K., Koshy, J., Kreps, J., Narkhede, N., Park, R., Rao, J., Ye, V.Y.: Building LinkedIn’s real-time activity data pipeline. IEEE Data Eng. Bull. 35(2), 33–45 (2012)Google Scholar
  47. 47.
    Grover, R., Carey, M.J.: Extending map-reduce for efficient predicate-based sampling. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 486–497 (2012)Google Scholar
  48. 48.
    Gufler, B., Augsten, N., Reiser, A., Kemper, A.: Load balancing in MapReduce based on scalable cardinality estimates. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 522–533 (2012)Google Scholar
  49. 49.
    Hall, A., Bachmann, O., Büssow, R., Ganceanu, S., Nunkesser, M.: Processing a trillion cells per mouse click. Proc. VLDB Endow. (PVLDB) 5(11), 1436–1446 (2012)Google Scholar
  50. 50.
    He, Y., Lee, R., Huai, Y., Shao, Z., Jain, N., Zhang, X., Xu, Z.: RCFile: a fast and space-efficient data placement structure in MapReduce-based warehouse systems. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 1199–1208 (2011)Google Scholar
  51. 51.
    Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of MapReduce programs. Proc. VLDB Endow. (PVLDB) 4(11), 1111–1122 (2011)Google Scholar
  52. 52.
    Hueske, F., Peters, M., Sax, M., Rheinländer, A., Bergmann, R., Krettek, A., Tzoumas, K.: Opening the black boxes in data flow optimization. Proc. VLDB End. (PVLDB) 5(11), 1256–1267 (2012)Google Scholar
  53. 53.
    Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of European Conference on Computer Systems (EuroSys), pp. 59–72 (2007)Google Scholar
  54. 54.
    Iu, M.-Y., Zwaenepoel, W.: HadoopToSQL: a MapReduce query optimizer. In: Proceedings of European Conference on Computer systems (EuroSys), pp. 251–264 (2010)Google Scholar
  55. 55.
    Jacox, E.H., Samet, H.: Metric space similarity joins. ACM Trans. Database Syst. (TODS) 33(2), 7:1–7:38 (2008)CrossRefGoogle Scholar
  56. 56.
    Jahani, E., Cafarella, M.J., Ré, C.: Automatic optimization for MapReduce programs. Proc. VLDB Endow. (PVLDB) 4(6), 385–396 (2011)Google Scholar
  57. 57.
    Jiang, D., Ooi, B.C., Shi, L., Wu, S.: The performance of MapReduce: an in-depth study. Proc. VLDB Endow. (PVLDB) 3(1), 472–483 (2010)Google Scholar
  58. 58.
    Jiang, D., Tung, A.K.H., Chen, G.: MAP-JOIN-REDUCE: toward scalable and efficient data analysis on large clusters. IEEE Trans. Knowl. Data Eng. (TKDE) 23(9), 1299–1311 (2011)CrossRefGoogle Scholar
  59. 59.
    Jindal, A., Quiané-Ruiz, J.-A., Dittrich, J.: Trojan data layouts: right shoes for a running elephant. In: ACM Symposium on Cloud Computing (SoCC), pp. 21:1–21:14 (2011)Google Scholar
  60. 60.
    Kaldewey, T., Shekita, E.J., Tata, S.: Clydesdale: structured data processing on MapReduce. In: Proceedings of International Conference on Extending Database Technology (EDBT), pp. 15–25 (2012)Google Scholar
  61. 61.
    Kim, Y., Shim, K.: Parallel top-k similarity join algorithms using MapReduce. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 510–521 (2012)Google Scholar
  62. 62.
    Kolb, L., Thor, A., Rahm, E.: Load balancing for MapReduce-based entity resolution. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 618–629 (2012)Google Scholar
  63. 63.
    Kornacker, M., Erickson, J.: Cloudera Impala: real-time queries in Apache Hadoop, for real. http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/
  64. 64.
    Kwon, Y., Balazinska, M., Howe, B., Rolia, J.A.: Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In: ACM Symposium on Cloud Computing (SoCC), pp. 75–86 (2010)Google Scholar
  65. 65.
    Kwon, Y., Balazinska, M., Howe, B., Rolia, J.A.: SkewTune: mitigating skew in MapReduce applications. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 25–36 (2012)Google Scholar
  66. 66.
    Lam, W., Liu, L., Prasad, S., Rajaraman, A., Vacheri, Z., Doan, A.: Muppet: MapReduce-style processing of fast data. Proc. VLDB Endow. (PVLDB) 5(12), 1814–1825 (2012)Google Scholar
  67. 67.
    Laptev, N., Zeng, K., Zaniolo, C.: Early accurate results for advanced analytics on MapReduce. Proc. VLDB Endow. (PVLDB) 5(10), 1028–1039 (2012)Google Scholar
  68. 68.
    Lee, K.-H., Lee, Y.-J., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing with MapReduce: a survey. SIGMOD Rec. 40(4), 11–20 (2011)CrossRefGoogle Scholar
  69. 69.
    Lee, R., Luo, T., Huai, Y., Wang, F., He, Y., Zhang, X.: YSmart: yet another SQL-to-MapReduce translator. In: Proceedings of International Conference on Distributed Computing Systems (ICDCS), pp. 25–36 (2011)Google Scholar
  70. 70.
    Leibiusky, J., Eisbruch, G., Simonassi, D.: Getting Started with Storm. O’Reilly, Ireland (2012)Google Scholar
  71. 71.
    Li, B., Mazur, E., Diao, Y., McGregor, A., Shenoy, P.J.: A platform for scalable one-pass analytics using MapReduce. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 985–996 (2011)Google Scholar
  72. 72.
    Li, B., Mazur, E., Diao, Y., McGregor, A., Shenoy, P.J.: SCALLA: a platform for scalable one-pass analytics using MapReduce. ACM Trans. Database Syst. (TODS) 37(4), 27:1–27:38 (2012)Google Scholar
  73. 73.
    Lim, H., Herodotou, H., Babu, S.: Stubby: a transformation-based optimizer for MapReduce workflows. Proc. VLDB Endow. (PVLDB) 5(11), 1196–1207 (2012)Google Scholar
  74. 74.
    Lin, Y., Agrawal, D., Chen, C., Ooi, B.C., Wu, S.: Llama: leveraging columnar storage for scalable join processing in the MapReduce framework. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 961–972 (2011)Google Scholar
  75. 75.
    Logothetis, D., Olston, C., Reed, B., Webb, K.C., Yocum, K.: Stateful bulk processing for incremental analytics. In: ACM Symposium on Cloud Computing (SoCC), pp. 51–62 (2010) Google Scholar
  76. 76.
    Logothetis, D., Yocum, K.: Ad-hoc data processing in the cloud. Proc. VLDB Endow. (PVLDB) 1(2), 1472–1475 (2008)Google Scholar
  77. 77.
    Lu, W., Shen, Y., Chen, S., Ooi, B.C.: Efficient processing of k nearest neighbor joins using MapReduce. Proc. VLDB Endow. (PVLDB) 5(10), 1016–1027 (2012)Google Scholar
  78. 78.
    Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 135–146 (2010)Google Scholar
  79. 79.
    McSherry, F., Murray, D.G., Isaacs, R., Isard, M.: Differential dataflow. In: Biennial Conference on Innovative Data Systems Research (CIDR) (2013)Google Scholar
  80. 80.
    Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. Proc. VLDB Endow. (PVLDB) 3(1–2), 330–339 (2010)Google Scholar
  81. 81.
    Metwally, A., Faloutsos, C.: V-SMART-Join: a scalable MapReduce framework for all-pair similarity joins of multisets and vectors. Proc. VLDB Endow. (PVLDB) 5(8), 704–715 (2012)Google Scholar
  82. 82.
    Mihaylov, S.R., Ives, Z.G., Guha, S.: REX: recursive, delta-based data-centric computation. Proc. VLDB Endow. (PVLDB) 5(11), 1280–1291 (2012)Google Scholar
  83. 83.
    Nykiel, T., Potamias, M., Mishra, C., Kollios, G., Koudas, N.: MRShare: sharing across multiple queries in MapReduce. Proc. VLDB Endow. (PVLDB) 3(1), 494–505 (2010)Google Scholar
  84. 84.
    Okcan, A., Riedewald, M.: Processing theta-joins using MapReduce. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 949–960 (2011)Google Scholar
  85. 85.
    Olston, C., Chiou, G., Chitnis, L., Liu, F., Han, Y., Larsson, M., Neumann, A., Rao, V.B.N., Sankarasubramanian, V., Seth, S., Tian, C., ZiCornell, T., Wang, X.: Nova: continuous Pig/Hadoop workflows. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 1081–1090 (2011)Google Scholar
  86. 86.
    Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: a not-so-foreign language for data processing. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 1099–1110 (2008)Google Scholar
  87. 87.
    Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 165–178 (2009)Google Scholar
  88. 88.
    Ramakrishnan, S.R., Swart, G., Urmanov, A.: Balancing reducer skew in MapReduce workloads using progressive sampling. In: ACM Symposium on Cloud Computing (SoCC), pp. 16:1–16:13 (2012)Google Scholar
  89. 89.
    Rao, S., Ramakrishnan, R., Silberstein, A., Ovsiannikov, M., Reeves, D.: Sailfish: a framework for large scale data processing. In: ACM Symposium on Cloud Computing (SoCC), pp. 4:1–4:13 (2012)Google Scholar
  90. 90.
    Rasmussen, A., Lam, V.T., Conley, M., Porter, G., Kapoor, R., Vahdat, A.: Themis: an I/O efficient MapReduce. In: ACM Symposium on Cloud Computing (SoCC), pp. 13:1–13:14 (2012)Google Scholar
  91. 91.
    Sakr, S., Liu, A., Batista, D.M., Alomari, M.: A survey of large scale data management approaches in cloud environments. IEEE Commun. Surv. Tutor. 13(3), 311–336 (2011)CrossRefGoogle Scholar
  92. 92.
    Schindler, J.: I/O characteristics of NoSQL databases. Proc. VLDB Endow. (PVLDB) 5(12), 2020–2021 (2012)Google Scholar
  93. 93.
    Shim, K.: MapReduce algorithms for Big Data analysis. Proc. VLDB Endow. (PVLDB) 5(12), 2016–2017 (2012)Google Scholar
  94. 94.
    Shinnar, A., Cunningham, D., Herta, B., Saraswat, V.A.: M3R: increased performance for in-memory Hadoop jobs. Proc. VLDB End. (PVLDB) 5(12), 1736–1747 (2012)Google Scholar
  95. 95.
    Silva, Y.N., Larson, P.-A., Zhou, J.: Exploiting common subexpressions for cloud query processing. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 1337–1348 (2012)Google Scholar
  96. 96.
    Silva, Y.N., Reed, J.M.: Exploiting MapReduce-based similarity joins. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 693–696 (2012)Google Scholar
  97. 97.
    Silva, Y.N., Reed, J.M., Tsosie, L.M.: MapReduce-based similarity join for metric spaces. In: Proceedings of International Workshop on Cloud Intelligence (Cloud-I), pp. 3:1–3:8 (2012)Google Scholar
  98. 98.
    Stonebraker, M., Abadi, D.J., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and parallel DBMSs: friends or foes? Commun. ACM 53(1), 64–71 (2010)CrossRefGoogle Scholar
  99. 99.
    Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive—a warehousing solution over a Map-Reduce framework. Proc. VLDB Endow. (PVLDB) 2(2), 1626–1629 (2009)Google Scholar
  100. 100.
    Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Anthony, S., Liu, H., Murthy, R.: Hive—a petabyte scale data warehouse using Hadoop. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 996–1005 (2010)Google Scholar
  101. 101.
    Vernica, R., Balmin, A., Beyer, K.S., Ercegovac, V.: Adaptive MapReduce using situation-aware mappers. In: Proceedings of International Conference on Extending Database Technology (EDBT), pp. 420–431 (2012)Google Scholar
  102. 102.
    Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 495–506 (2010)Google Scholar
  103. 103.
    Vlachou, A., Doulkeridis, C., Kotidis, Y.: Angle-based space partitioning for efficient parallel skyline computation. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 227–238 (2008)Google Scholar
  104. 104.
    White, T.: Hadoop—The Definitive Guide: Storage and Analysis at Internet Scale (3. ed., revised and updated). O’Reilly, Ireland (2012)Google Scholar
  105. 105.
    Wu, S., Li, F., Mehrotra, S., Ooi, B.C.: Query optimization for massively parallel data processing. In: ACM Symposium on Cloud Computing (SoCC), pp. 12:1–12:13 (2011)Google Scholar
  106. 106.
    Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: International World Wide Web Conferences (WWW), pp. 131–140 (2008)Google Scholar
  107. 107.
    Xin, R., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I.: Shark: SQL and rich analytics at scale. The Computing Research Repository (CoRR), abs/1211.6176 (2012)Google Scholar
  108. 108.
    Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M., Shenker, S., Stoica, I.: Fast and interactive analytics over Hadoop data with Spark. USENIX; login 37(4), 45–51 (2012)Google Scholar
  109. 109.
    Zhang, J., Zhou, H., Chen, R., Fan, X., Guo, Z., Lin, H., Li, J.Y., Lin, W., Zhou, J., Zhou, L.: Optimizing data shuffling in data-parallel computation by understanding user-defined functions. In: Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), pp. 22:1–22:14 (2012)Google Scholar
  110. 110.
    Zhang, X., Chen, L., Wang, M.: Efficient multiway theta-join processing using MapReduce. Proc. VLDB Endow. (PVLDB) 5(11), 1184–1195 (2012)Google Scholar
  111. 111.
    Zhang, Y., Gao, Q., Gao, L., Wang, C.: PrIter: a distributed framework for prioritized iterative computations. In: ACM Symposium on Cloud Computing (SoCC), pp. 13:1–13:14 (2011)Google Scholar
  112. 112.
    Zhou, J., Bruno, N., Wu, M.-C., Larson, P.-Å., Chaiken, R., Shakib, D.: SCOPE: parallel databases meet MapReduce. VLDB J. 21(5), 611–636 (2012)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  1. 1.Department of Digital SystemsUniversity of PiraeusPiraeusGreece
  2. 2.Department of Computer and Information ScienceNorwegian University of Science and TechnologyTrondheimNorway

Personalised recommendations