Advertisement

Scientometrics

, Volume 109, Issue 1, pp 389–422 | Cite as

MapReduce: Review and open challenges

  • Ibrahim Abaker Targio Hashem
  • Nor Badrul Anuar
  • Abdullah Gani
  • Ibrar Yaqoob
  • Feng Xia
  • Samee Ullah Khan
Article

Abstract

The continuous increase in computational capacity over the past years has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data signify a new era in data exploration and utilization. The MapReduce computational paradigm is a major enabler for underlying numerous big data platforms. MapReduce is a popular tool for the distributed and scalable processing of big data. It is increasingly being used in different applications primarily because of its important features, including scalability, fault tolerance, ease of programming, and flexibility. Thus, bibliometric analysis and review was conducted to evaluate the trend of MapReduce research assessment publications indexed in Scopus from 2006 to 2015. This trend includes the use of the MapReduce framework for big data processing and its development. The study analyzed the distribution of published articles, countries, authors, keywords, and authorship pattern. For data visualization, VOSviewer program was used to produce distance- and graph-based maps. The top 10 most cited articles were also identified based on the citation count of publications. The study utilized productivity measures, domain visualization techniques and co-word to explore papers related to MapReduce in the field of big data. Moreover, the study discussed the most influential articles contributed to the improvements in MapReduce and reviewed the corresponding solutions. Finally, it presented several open challenges on big data processing with MapReduce as future research directions.

Keywords

Big data MapReduce Hadoop Bibliometric 

Notes

Acknowledgments

This paper is financially supported by the Malaysian Ministry of Education under the University of Malaya High Impact Research Grant UM.C/625/1/HIR/MoE/FCSIT/03

References

  1. Afrati, F., Dolev, S., Korach, E., Sharma, S., & Ullman, J. D. (2015). Assignment problems of different-sized inputs in mapreduce. arXiv:1507.04461.
  2. Ahmad, F., Lee, S., Thottethodi, M., & Vijaykumar, T. (2013). MapReduce with communication overlap (MaRCO). Journal of Parallel and Distributed Computing, 73(5), 608–620.CrossRefGoogle Scholar
  3. Anjos, J. C., Carrera, I., Kolberg, W., Tibola, A. L., Arantes, L. B., & Geyer, C. R. (2015). MRA++: Scheduling and data placement on MapReduce for heterogeneous environments. Future Generation Computer Systems, 42, 22–35.CrossRefGoogle Scholar
  4. Beyer, K. S., Ercegovac, V., Gemulla, R., Balmin, A., Eltabakh, M., Kanne, C.-C., et al. (2011). Jaql: A scripting language for large scale semistructured data analysis. Proceedings of VLDB conference4(12), 1272–1283.Google Scholar
  5. Bhatotia, P., Wieder, A., Rodrigues, R., Acar, U. A., & Pasquin, R. (2011). Incoop: MapReduce for incremental computations. Proceedings of the 2nd ACM Symposium on Cloud Computing. doi: 10.1145/2038916.2038923.
  6. Bollier, D., & Firestone, C. M. (2010). The promise and peril of big data. Washington, DC: Aspen Institute, Communications and Society Program.Google Scholar
  7. Bu, Y., Howe, B., Balazinska, M., & Ernst, M. D. (2010). HaLoop: Efficient iterative data processing on large clusters. Proceedings of the VLDB Endowment, 3(1–2), 285–296.CrossRefGoogle Scholar
  8. Chang, L., Wang, Z., Ma, T., Jian, L., Ma, L., Goldshuv, A., et al. (2014). Hawq: A massively parallel processing sql engine in hadoop. Paper presented at the proceedings of the 2014 ACM SIGMOD international conference on management of data.Google Scholar
  9. Chen, S. (2010). Cheetah: A high performance, custom data warehouse on top of MapReduce. Proceedings of the VLDB Endowment, 3(1–2), 1459–1468.CrossRefGoogle Scholar
  10. Chen, R., & Chen, H. (2013). Tiled-MapReduce: Efficient and flexible MapReduce processing on multicore with tiling. ACM Transactions on Architecture and Code Optimization (TACO), 10(1), 3.Google Scholar
  11. Chen, C. L. P., & Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences, 275, 314–347. doi: 10.1016/j.ins.2014.01.015.CrossRefGoogle Scholar
  12. Chen, M., Mao, S., & Liu, Y. (2014). Big data: A survey. Mobile Networks and Applications, 19(2), 171–209.MathSciNetCrossRefGoogle Scholar
  13. Cui, X., Zhu, P., Yang, X., Li, K., & Ji, C. (2014). Optimized big data K-means clustering using MapReduce. The Journal of Supercomputing, 70(3), 1249–1259.CrossRefGoogle Scholar
  14. Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1), 107–113.CrossRefGoogle Scholar
  15. Dean, J., & Ghemawat, S. (2010). MapReduce: A flexible data processing tool. Communications of the ACM, 53(1), 72–77.CrossRefGoogle Scholar
  16. Ding, Y., Chowdhury, G. G., & Foo, S. (2001). Bibliometric cartography of information retrieval research by using co-word analysis. Information Processing and Management, 37(6), 817–842. doi: 10.1016/S0306-4573(00)00051-0.zbMATHCrossRefGoogle Scholar
  17. Ding, L., Wang, G., Xin, J., Wang, X., Huang, S., & Zhang, R. (2013). ComMapReduce: An improvement of mapreduce with lightweight communication mechanisms. Data & Knowledge Engineering, 88, 224–247.CrossRefGoogle Scholar
  18. Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., & Schad, J. (2010). Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). Proceedings of the VLDB Endowment, 3(1–2), 515–529.CrossRefGoogle Scholar
  19. Dittrich, J., Quiané-Ruiz, J.-A., Richter, S., Schuh, S., Jindal, A., & Schad, J. (2012). Only aggressive elephants are fast elephants. Proceedings of the VLDB Endowment, 5(11), 1591–1602.CrossRefGoogle Scholar
  20. Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., & Fox, G. (2010). Twister: A runtime for iterative mapreduce. Paper presented at the proceedings of the 19th ACM international symposium on high performance distributed computing.Google Scholar
  21. Falagas, M. E., Pitsouni, E. I., Malietzis, G. A., & Pappas, G. (2008). Comparison of PubMed, Scopus, web of science, and Google scholar: Strengths and weaknesses. The FASEB Journal, 22(2), 338–342.CrossRefGoogle Scholar
  22. Floratou, A., Patel, J. M., Shekita, E. J., & Tata, S. (2011). Column-oriented storage techniques for MapReduce. Proceedings of the VLDB Endowment, 4(7), 419–429.CrossRefGoogle Scholar
  23. Friedman, E., Pawlowski, P., & Cieslewicz, J. (2009). SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions. Proceedings of the VLDB Endowment, 2(2), 1402–1413.CrossRefGoogle Scholar
  24. Fu, H.-Z., Wang, M.-H., & Ho, Y.-S. (2013). Mapping of drinking water research: A bibliometric analysis of research output during 1992–2011. Science of the Total Environment, 443, 757–765.CrossRefGoogle Scholar
  25. Gani, A., Siddiqa, A., Shamshirband, S., & Hanum, F. (2016). A survey on indexing techniques for big data: taxonomy and performance evaluation. Knowledge and Information Systems, 46(2), 241–284.CrossRefGoogle Scholar
  26. Ghemawat, S., Gobioff, H., & Leung, S.-T. (2003). The Google file system. Paper presented at the ACM SIGOPS operating systems review.Google Scholar
  27. Ghit, B., Yigitbasi, N., Iosup, A., & Epema, D. (2014). Balanced resource allocations across multiple dynamic MapReduce clusters. Paper presented at the ACM SIGMETRICS.Google Scholar
  28. Greenspan, J., & Valkova, S. (2014). Using big healthcare data for ILI situational awareness in Georgia. Online Journal of Public Health Informatics, 6(1). doi: 10.5210/ojphi.v6i1.5193.
  29. Gu, R., Yang, X., Yan, J., Sun, Y., Wang, B., Yuan, C., & Huang, Y. (2014). SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters. Journal of Parallel and Distributed Computing, 74(3), 2166–2179.CrossRefGoogle Scholar
  30. Gunarathne, T., Wu, T.-L., Qiu, J., & Fox, G. (2010). MapReduce in the clouds for science. Paper presented at the 2010 IEEE second international conference on cloud computing technology and science (CloudCom).Google Scholar
  31. Gunarathne, T., Zhang, B., Wu, T.-L., & Qiu, J. (2013). Scalable parallel computing on clouds using Twister4Azure iterative MapReduce. Future Generation Computer Systems, 29(4), 1035–1048.CrossRefGoogle Scholar
  32. Hadoop, A. (2011). Apache Hadoop.  Retrieved from https://hadoop.apache.org/.
  33. He, Y., Lee, R., Huai, Y., Shao, Z., Jain, N., Zhang, X., & Xu, Z. (2011). RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems. Paper presented at the 2011 IEEE 27th international conference on data engineering (ICDE).Google Scholar
  34. Hsu, C.-H. (2014). Intelligent big data processing. Future Generation Computer Systems, 36, 16–18. doi: 10.1016/j.future.2014.02.003.CrossRefGoogle Scholar
  35. Hu, S., Liu, W., Rabl, T., Huang, S., Liang, Y., Xiao, Z., et al. (2014). DualTable: A hybrid storage model for update optimization in hive. arXiv preprint arXiv:1404.6878.
  36. Ibrahim, S., Phan, T.-D., Carpen-Amarie, A., Chihoub, H.-E., Moise, D., & Antoniu, G. (2016). Governing energy consumption in Hadoop through CPU frequency scaling: An analysis. Future Generation Computer Systems. doi: 10.1016/j.future.2015.01.005.
  37. Ibrahim, S., Jin, H., Lu, L., He, B., Antoniu, G., & Wu, S. (2012). Maestro: Replica-aware map scheduling for mapreduce. Paper presented at the 2012 12th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGrid).Google Scholar
  38. Isard, M., Prabhakaran, V., Currey, J., Wieder, U., Talwar, K., & Goldberg, A. (2009). Quincy: Fair scheduling for distributed computing clusters. Paper presented at the proceedings of the ACM SIGOPS 22nd symposium on operating systems principles.Google Scholar
  39. Jiang, H., Chen, Y., Qiao, Z., Weng, T.-H., & Li, K.-C. (2014). Scaling up MapReduce-based big data processing on multi-GPU systems. Cluster Computing, 18(1), 1–15.Google Scholar
  40. Jindal, A., Quiané-Ruiz, J.-A., & Dittrich, J. (2011). Trojan data layouts: Right shoes for a running elephant. Paper presented at the proceedings of the 2nd ACM symposium on cloud computing.Google Scholar
  41. Kalavri, V., & Vlassov, V. (2013). Mapreduce: Limitations, optimizations and open issues. Paper presented at the 2013 12th IEEE international conference on trust, security and privacy in computing and communications (TrustCom).Google Scholar
  42. Kambatla, K., Kollias, G., Kumar, V., & Grama, A. (2014). Trends in big data analytics. Journal of Parallel and Distributed Computing, 74(7), 2561–2573.CrossRefGoogle Scholar
  43. Kim, G.-H., Trimi, S., & Chung, J.-H. (2014). Big-data applications in the government sector. Communications of the ACM, 57(3), 78–85.CrossRefGoogle Scholar
  44. Labrinidis, A., & Jagadish, H. (2012). Challenges and opportunities with big data. Proceedings of the VLDB Endowment, 5(12), 2032–2033.CrossRefGoogle Scholar
  45. Lam, W., Liu, L., Prasad, S., Rajaraman, A., Vacheri, Z., & Doan, A. (2012). Muppet: MapReduce-style processing of fast data. Proceedings of the VLDB Endowment, 5(12), 1814–1825. doi: 10.14778/2367502.2367520.CrossRefGoogle Scholar
  46. Lama, P., & Zhou, X. (2012). Aroma: Automated resource allocation and configuration of mapreduce environment in the cloud. Paper presented at the proceedings of the 9th international conference on Autonomic computing.Google Scholar
  47. Lämmel, R. (2008). Google’s MapReduce programming model—Revisited. Science of Computer Programming, 70(1), 1–30. doi: 10.1016/j.scico.2007.07.001.MathSciNetzbMATHCrossRefGoogle Scholar
  48. Lee, D., Kim, J.-S., & Maeng, S. (2014). Large-scale incremental processing with MapReduce. Future Generation Computer Systems, 36, 66–79. doi: 10.1016/j.future.2013.09.010.CrossRefGoogle Scholar
  49. Lin, L., Lychagina, V., Liu, W., Kwon, Y., Mittal, S., & Wong, M. (2011a). Tenzing a sql implementation on the mapreduce framework. Proceedings of the VLDB Endowment, 4(12), 1318–1327.Google Scholar
  50. Lin, Y., Agrawal, D., Chen, C., Ooi, B. C., & Wu, S. (2011b). Llama: leveraging columnar storage for scalable join processing in the MapReduce framework. Paper presented at the proceedings of the 2011 ACM SIGMOD international conference on management of data.Google Scholar
  51. Lin, M., Zhang, L., Wierman, A., & Tan, J. (2013). Joint optimization of overlapping phases in MapReduce. Performance Evaluation, 70(10), 720–735.CrossRefGoogle Scholar
  52. Lyon, D. (2014). Surveillance, snowden, and big data: Capacities, consequences, critique. Big Data & Society, 1(2), 2053951714541861.CrossRefGoogle Scholar
  53. Maheshwari, N., Nanduri, R., & Varma, V. (2012). Dynamic energy efficient data placement and cluster reconfiguration algorithm for MapReduce framework. Future Generation Computer Systems, 28(1), 119–127.CrossRefGoogle Scholar
  54. Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big data: The next frontier for innovation, competition, and productivity.Google Scholar
  55. Mao, G., Zou, H., Chen, G., Du, H., & Zuo, J. (2015). Past, current and future of biomass energy research: A bibliometric analysis. Renewable and Sustainable Energy Reviews, 52, 1823–1833. doi: 10.1016/j.rser.2015.07.141.CrossRefGoogle Scholar
  56. McAfee, A., Brynjolfsson, E., Davenport, T. H., Patil, D., & Barton, D. (2012). Big Data. The management revolution. Harvard Bus Rev, 90(10), 61–67.Google Scholar
  57. McCreadie, R., Macdonald, C., & Ounis, I. (2012). MapReduce indexing strategies: Studying scalability and efficiency. Information Processing and Management, 48(5), 873–888.CrossRefGoogle Scholar
  58. McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., & Daly, M. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research, 20(9), 1297–1303.CrossRefGoogle Scholar
  59. Meho, L. I., & Yang, K. (2007). Impact of data sources on citation counts and rankings of LIS faculty: Web of Science versus Scopus and Google Scholar. Journal of the American Society for Information Science and Technology, 58(13), 2105–2125.CrossRefGoogle Scholar
  60. Mihaylov, S. R., Ives, Z. G., & Guha, S. (2012). REX: Recursive, delta-based data-centric computation. Proceedings of the VLDB Endowment, 5(11), 1280–1291.CrossRefGoogle Scholar
  61. Murthy, A. C., Douglas, C., Konar, M., O’Malley, O., Radia, S., Agarwal, S., et al. (2011). Architecture of next generation Apache Hadoop MapReduce framework. Technical report, Apache Hadoop.Google Scholar
  62. Murthy, A. C., Vavilapalli, V. K., Eadline, D., Niemiec, J., & Markham, J. (2013). Apache Hadoop YARN: Moving beyond MapReduce and batch processing with Apache Hadoop 2. Boca Raton: Taylor & Francis.Google Scholar
  63. Nykiel, T., Potamias, M., Mishra, C., Kollios, G., & Koudas, N. (2010). MRShare: Sharing across multiple queries in MapReduce. Proceedings of the VLDB Endowment, 3(1–2), 494–505.zbMATHCrossRefGoogle Scholar
  64. Olston, C., Reed, B., Srivastava, U., Kumar, R., & Tomkins, A. (2008). Pig latin: A not-so-foreign language for data processing. Paper presented at the proceedings of the 2008 ACM SIGMOD international conference on management of data.Google Scholar
  65. Pike, R., Dorward, S., Griesemer, R., & Quinlan, S. (2005). Interpreting the data: Parallel analysis with Sawzall. Scientific Programming, 13(4), 277–298.CrossRefGoogle Scholar
  66. Polato, I., Ré, R., Goldman, A., & Kon, F. (2014). A comprehensive view of Hadoop research—A systematic literature review. Journal of Network and Computer Applications, 46, 1–25. doi: 10.1016/j.jnca.2014.07.022.CrossRefGoogle Scholar
  67. Qi, C., Cheng, L., & Zhen, X. (2014). Improving mapreduce performance using smart speculative execution strategy. IEEE Transactions on Computers, 63(4), 954–967. doi: 10.1109/TC.2013.15.MathSciNetCrossRefGoogle Scholar
  68. Rasooli, A., & Down, D. G. (2014). COSHH: A classification and optimization based scheduler for heterogeneous Hadoop systems. Future Generation Computer Systems, 36, 1–15.CrossRefGoogle Scholar
  69. Richter, S., Quiané-Ruiz, J.-A., Schuh, S., & Dittrich, J. (2012). Towards zero-overhead adaptive indexing in Hadoop. arXiv preprint arXiv:1212.3480.
  70. Rothstein, M. A. (2015). Ethical Issues in Big Data Health Research. Journal of Law, Medicine and Ethics, 43(2), 425–429.CrossRefGoogle Scholar
  71. Sakr, S., Liu, A., & Fayoumi, A. G. (2013). The family of MapReduce and large-scale data processing systems. ACM Computing Surveys (CSUR), 46(1), 11.CrossRefGoogle Scholar
  72. Shvachko, K., Kuang, H., Radia, S., & Chansler, R. (2010). The hadoop distributed file system. Paper presented at the 2010 IEEE 26th symposium on mass storage systems and technologies (MSST).Google Scholar
  73. Srirama, S. N., Jakovits, P., & Vainikko, E. (2012). Adapting scientific computing problems to clouds using MapReduce. Future Generation Computer Systems, 28(1), 184–192.CrossRefGoogle Scholar
  74. Sun, J., Wang, M.-H., & Ho, Y.-S. (2012). A historical review and bibliometric analysis of research on estuary pollution. Marine Pollution Bulletin, 64(1), 13–21.CrossRefGoogle Scholar
  75. Talia, D. (2013). Clouds for scalable big data analytics. Computer, 46(5), 98–101. doi: 10.1109/MC.2013.162.CrossRefGoogle Scholar
  76. Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Anthony, S., & Murthy, R. (2009). Hive: A warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment, 2(2), 1626–1629.CrossRefGoogle Scholar
  77. van Eck, N., & Waltman, L. (2009). Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics, 84(2), 523–538.Google Scholar
  78. Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., et al. (2013). Apache hadoop yarn: Yet another resource negotiator. Paper presented at the proceedings of the 4th annual symposium on cloud computing.Google Scholar
  79. Verma, A., Cherkasova, L., & Campbell, R. H. (2011). ARIA: Automatic resource inference and allocation for mapreduce environments. Paper presented at the proceedings of the 8th ACM international conference on autonomic computing.Google Scholar
  80. White, T. (2009). Hadoop: The definitive guide: The definitive guide. Sebastopol: O’Reilly Media.Google Scholar
  81. Wirtz, T., & Ge, R. (2011). Improving mapreduce energy efficiency for computation intensive workloads. Paper presented at the 2011 international green computing conference and workshops (IGCC).Google Scholar
  82. Wolf, J., Rajan, D., Hildrum, K., Khandekar, R., Kumar, V., Parekh, S., et al. (2010). Flex: A slot allocation scheduling optimizer for mapreduce workloads. In Middleware 2010 (pp. 1–20). Berlin: Springer.Google Scholar
  83. Yan, F., Cherkasova, L., Zhang, Z., & Smirni, E. (2014). Heterogeneous cores for mapreduce processing: Opportunity or challenge? Paper presented at the proceedings of IEEE/IFIP NOMS.Google Scholar
  84. Yang, S.-J., & Chen, Y.-R. (2015). Design adaptive task allocation scheduler to improve MapReduce performance in heterogeneous clouds. Journal of Network and Computer Applications, 57, 61–70. doi: 10.1016/j.jnca.2015.07.012.CrossRefGoogle Scholar
  85. Yazti, D. Z., & Krishnaswamy, S. (2014). Mobile big data analytics: Research, practice, and opportunities. Paper presented at the 2014 IEEE 15th international conference on mobile data management (MDM).Google Scholar
  86. Zhang, Y., Gao, Q., Gao, L., & Wang, C. (2012). imapreduce: A distributed computing framework for iterative computation. Journal of Grid Computing, 10(1), 47–68.CrossRefGoogle Scholar
  87. Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R. H., & Stoica, I. (2008). Improving map reduce performance in heterogeneous environments. In OSDI 8(4), 7.Google Scholar
  88. Zhifeng, X., & Yang, X. (2013). Security and privacy in cloud computing. Communications Surveys & Tutorials, IEEE, 15(2), 843–859.CrossRefGoogle Scholar
  89. Zhou, J., Bruno, N., Wu, M.-C., Larson, P.-A., Chaiken, R., & Shakib, D. (2012). SCOPE: Parallel databases meet MapReduce. The VLDB Journal—The International Journal on Very Large Data Bases, 21(5), 611–636.CrossRefGoogle Scholar
  90. Zhu, H. P., Xu, Y., Liu, Q., & Rao, Y. Q. (2014). Cloud service platform for big data of manufacturing. Applied Mechanics and Materials, 456, 178–183.CrossRefGoogle Scholar

Copyright information

© Akadémiai Kiadó, Budapest, Hungary 2016

Authors and Affiliations

  • Ibrahim Abaker Targio Hashem
    • 1
  • Nor Badrul Anuar
    • 1
  • Abdullah Gani
    • 1
  • Ibrar Yaqoob
    • 1
  • Feng Xia
    • 2
  • Samee Ullah Khan
    • 3
  1. 1.Faculty of Computer Science and Information TechnologyUniversity of MalayaKuala LumpurMalaysia
  2. 2.School of SoftwareDalian University of TechnologyDalianChina
  3. 3.NDSU-CIIT Green Computing and CommunicationsNorth Dakota State UniversityFargoUSA

Personalised recommendations