The continuous increase in computational capacity over the past years has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data signify a new era in data exploration and utilization. The MapReduce computational paradigm is a major enabler for underlying numerous big data platforms. MapReduce is a popular tool for the distributed and scalable processing of big data. It is increasingly being used in different applications primarily because of its important features, including scalability, fault tolerance, ease of programming, and flexibility. Thus, bibliometric analysis and review was conducted to evaluate the trend of MapReduce research assessment publications indexed in Scopus from 2006 to 2015. This trend includes the use of the MapReduce framework for big data processing and its development. The study analyzed the distribution of published articles, countries, authors, keywords, and authorship pattern. For data visualization, VOSviewer program was used to produce distance- and graph-based maps. The top 10 most cited articles were also identified based on the citation count of publications. The study utilized productivity measures, domain visualization techniques and co-word to explore papers related to MapReduce in the field of big data. Moreover, the study discussed the most influential articles contributed to the improvements in MapReduce and reviewed the corresponding solutions. Finally, it presented several open challenges on big data processing with MapReduce as future research directions.
KeywordsBig data MapReduce Hadoop Bibliometric
This paper is financially supported by the Malaysian Ministry of Education under the University of Malaya High Impact Research Grant UM.C/625/1/HIR/MoE/FCSIT/03
- Afrati, F., Dolev, S., Korach, E., Sharma, S., & Ullman, J. D. (2015). Assignment problems of different-sized inputs in mapreduce. arXiv:1507.04461.
- Beyer, K. S., Ercegovac, V., Gemulla, R., Balmin, A., Eltabakh, M., Kanne, C.-C., et al. (2011). Jaql: A scripting language for large scale semistructured data analysis. Proceedings of VLDB conference, 4(12), 1272–1283.Google Scholar
- Bhatotia, P., Wieder, A., Rodrigues, R., Acar, U. A., & Pasquin, R. (2011). Incoop: MapReduce for incremental computations. Proceedings of the 2nd ACM Symposium on Cloud Computing. doi: 10.1145/2038916.2038923.
- Bollier, D., & Firestone, C. M. (2010). The promise and peril of big data. Washington, DC: Aspen Institute, Communications and Society Program.Google Scholar
- Chang, L., Wang, Z., Ma, T., Jian, L., Ma, L., Goldshuv, A., et al. (2014). Hawq: A massively parallel processing sql engine in hadoop. Paper presented at the proceedings of the 2014 ACM SIGMOD international conference on management of data.Google Scholar
- Chen, R., & Chen, H. (2013). Tiled-MapReduce: Efficient and flexible MapReduce processing on multicore with tiling. ACM Transactions on Architecture and Code Optimization (TACO), 10(1), 3.Google Scholar
- Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., & Fox, G. (2010). Twister: A runtime for iterative mapreduce. Paper presented at the proceedings of the 19th ACM international symposium on high performance distributed computing.Google Scholar
- Ghemawat, S., Gobioff, H., & Leung, S.-T. (2003). The Google file system. Paper presented at the ACM SIGOPS operating systems review.Google Scholar
- Ghit, B., Yigitbasi, N., Iosup, A., & Epema, D. (2014). Balanced resource allocations across multiple dynamic MapReduce clusters. Paper presented at the ACM SIGMETRICS.Google Scholar
- Greenspan, J., & Valkova, S. (2014). Using big healthcare data for ILI situational awareness in Georgia. Online Journal of Public Health Informatics, 6(1). doi: 10.5210/ojphi.v6i1.5193.
- Gunarathne, T., Wu, T.-L., Qiu, J., & Fox, G. (2010). MapReduce in the clouds for science. Paper presented at the 2010 IEEE second international conference on cloud computing technology and science (CloudCom).Google Scholar
- Hadoop, A. (2011). Apache Hadoop. Retrieved from https://hadoop.apache.org/.
- He, Y., Lee, R., Huai, Y., Shao, Z., Jain, N., Zhang, X., & Xu, Z. (2011). RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems. Paper presented at the 2011 IEEE 27th international conference on data engineering (ICDE).Google Scholar
- Hu, S., Liu, W., Rabl, T., Huang, S., Liang, Y., Xiao, Z., et al. (2014). DualTable: A hybrid storage model for update optimization in hive. arXiv preprint arXiv:1404.6878.
- Ibrahim, S., Phan, T.-D., Carpen-Amarie, A., Chihoub, H.-E., Moise, D., & Antoniu, G. (2016). Governing energy consumption in Hadoop through CPU frequency scaling: An analysis. Future Generation Computer Systems. doi: 10.1016/j.future.2015.01.005.
- Ibrahim, S., Jin, H., Lu, L., He, B., Antoniu, G., & Wu, S. (2012). Maestro: Replica-aware map scheduling for mapreduce. Paper presented at the 2012 12th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGrid).Google Scholar
- Isard, M., Prabhakaran, V., Currey, J., Wieder, U., Talwar, K., & Goldberg, A. (2009). Quincy: Fair scheduling for distributed computing clusters. Paper presented at the proceedings of the ACM SIGOPS 22nd symposium on operating systems principles.Google Scholar
- Jiang, H., Chen, Y., Qiao, Z., Weng, T.-H., & Li, K.-C. (2014). Scaling up MapReduce-based big data processing on multi-GPU systems. Cluster Computing, 18(1), 1–15.Google Scholar
- Jindal, A., Quiané-Ruiz, J.-A., & Dittrich, J. (2011). Trojan data layouts: Right shoes for a running elephant. Paper presented at the proceedings of the 2nd ACM symposium on cloud computing.Google Scholar
- Kalavri, V., & Vlassov, V. (2013). Mapreduce: Limitations, optimizations and open issues. Paper presented at the 2013 12th IEEE international conference on trust, security and privacy in computing and communications (TrustCom).Google Scholar
- Lama, P., & Zhou, X. (2012). Aroma: Automated resource allocation and configuration of mapreduce environment in the cloud. Paper presented at the proceedings of the 9th international conference on Autonomic computing.Google Scholar
- Lin, L., Lychagina, V., Liu, W., Kwon, Y., Mittal, S., & Wong, M. (2011a). Tenzing a sql implementation on the mapreduce framework. Proceedings of the VLDB Endowment, 4(12), 1318–1327.Google Scholar
- Lin, Y., Agrawal, D., Chen, C., Ooi, B. C., & Wu, S. (2011b). Llama: leveraging columnar storage for scalable join processing in the MapReduce framework. Paper presented at the proceedings of the 2011 ACM SIGMOD international conference on management of data.Google Scholar
- Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big data: The next frontier for innovation, competition, and productivity.Google Scholar
- McAfee, A., Brynjolfsson, E., Davenport, T. H., Patil, D., & Barton, D. (2012). Big Data. The management revolution. Harvard Bus Rev, 90(10), 61–67.Google Scholar
- Murthy, A. C., Douglas, C., Konar, M., O’Malley, O., Radia, S., Agarwal, S., et al. (2011). Architecture of next generation Apache Hadoop MapReduce framework. Technical report, Apache Hadoop.Google Scholar
- Murthy, A. C., Vavilapalli, V. K., Eadline, D., Niemiec, J., & Markham, J. (2013). Apache Hadoop YARN: Moving beyond MapReduce and batch processing with Apache Hadoop 2. Boca Raton: Taylor & Francis.Google Scholar
- Olston, C., Reed, B., Srivastava, U., Kumar, R., & Tomkins, A. (2008). Pig latin: A not-so-foreign language for data processing. Paper presented at the proceedings of the 2008 ACM SIGMOD international conference on management of data.Google Scholar
- Richter, S., Quiané-Ruiz, J.-A., Schuh, S., & Dittrich, J. (2012). Towards zero-overhead adaptive indexing in Hadoop. arXiv preprint arXiv:1212.3480.
- Shvachko, K., Kuang, H., Radia, S., & Chansler, R. (2010). The hadoop distributed file system. Paper presented at the 2010 IEEE 26th symposium on mass storage systems and technologies (MSST).Google Scholar
- van Eck, N., & Waltman, L. (2009). Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics, 84(2), 523–538.Google Scholar
- Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., et al. (2013). Apache hadoop yarn: Yet another resource negotiator. Paper presented at the proceedings of the 4th annual symposium on cloud computing.Google Scholar
- Verma, A., Cherkasova, L., & Campbell, R. H. (2011). ARIA: Automatic resource inference and allocation for mapreduce environments. Paper presented at the proceedings of the 8th ACM international conference on autonomic computing.Google Scholar
- White, T. (2009). Hadoop: The definitive guide: The definitive guide. Sebastopol: O’Reilly Media.Google Scholar
- Wirtz, T., & Ge, R. (2011). Improving mapreduce energy efficiency for computation intensive workloads. Paper presented at the 2011 international green computing conference and workshops (IGCC).Google Scholar
- Wolf, J., Rajan, D., Hildrum, K., Khandekar, R., Kumar, V., Parekh, S., et al. (2010). Flex: A slot allocation scheduling optimizer for mapreduce workloads. In Middleware 2010 (pp. 1–20). Berlin: Springer.Google Scholar
- Yan, F., Cherkasova, L., Zhang, Z., & Smirni, E. (2014). Heterogeneous cores for mapreduce processing: Opportunity or challenge? Paper presented at the proceedings of IEEE/IFIP NOMS.Google Scholar
- Yazti, D. Z., & Krishnaswamy, S. (2014). Mobile big data analytics: Research, practice, and opportunities. Paper presented at the 2014 IEEE 15th international conference on mobile data management (MDM).Google Scholar
- Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R. H., & Stoica, I. (2008). Improving map reduce performance in heterogeneous environments. In OSDI 8(4), 7.Google Scholar