Encyclopedia of Database Systems

2018 Edition
| Editors: Ling Liu, M. Tamer Özsu

MapReduce

Reference work entry
DOI: https://doi.org/10.1007/978-1-4614-8265-9_80802

Scientific Fundamentals

MapReduce refers to both a programming model and the corresponding distributed framework. Its model is composed of two phases, map and reduce, which manipulate data formated as key-value pairs. Map phase splits and sorts data on keys, whereas reduce phase applies user-defined function to process data with the same key. In this way, MapReduce is a typical divide-and-conquer framework that is designed to handle embarrassingly parallel problems, namely problems that can be split into sub-tasks with little or no synchronization costs.

Definition

MapReduce is a programming framework that allows users to process large-scaled data by leveraging the parallelism among a cluster of nodes. It is also used to refer to the distributed engine which splits and disseminates users’ jobs and monitors their processing in the cluster. MapReduce is a typical divide-and-conquer framework, since it transforms the user code into an embarrassingly parallel job, where little or no effort...

This is a preview of subscription content, log in to check access.

Recommended Reading

  1. 1.
    Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th USENIX Symposium on Operating System Design and Implementation; 2004. p. 137–50.Google Scholar
  2. 2.
  3. 3.
    Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung. The Google file system. In: Proceedings of the 19th ACM Symposium on Operating System Principles; 2003. p. 29–43.Google Scholar
  4. 4.
    Dittrich J, Quiané-Ruiz J-A, Jindal A, Kargin Y, Setty V, Schad J. Hadoop++: making a yellow elephant run like a cheetah (without It even noticing). Proc VLDB Endow. 2010;3(1):518–29.Google Scholar
  5. 5.
  6. 6.
    Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R. Hive – a warehousing solution over a map-reduce framework. Proc VLDB Endow. 2009;2(2):1626–9.CrossRefGoogle Scholar
  7. 7.
  8. 8.
    Olston C, Reed B, Srivastava U, Kumar R, Tomkins A. Pig latin: a not-so-foreign language for data processing. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2008. p. 1099–110.Google Scholar
  9. 9.
  10. 10.
    Pavlo A, Paulson E, Rasin A, Abadi DJ, DeWitt DJ, Madden S, Stonebraker M. A comparison of approaches to large-scale data analysis. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2009. p. 165–78.Google Scholar
  11. 11.
    Jiang D, Ooi BC, Shi L, Wu S. The performance of MapReduce: an in-depth study. Proc VLDB Endow. 2010;3(1):472–83.CrossRefGoogle Scholar
  12. 12.
    Sai Wu, Feng Li, Sharad Mehrotra, Beng Chin Ooi. Query optimization for massively parallel data processing. In: Proceedings of the 2nd ACM Symposium on Cloud Computing; 2011. p. 12.Google Scholar
  13. 13.
    Afrati FN, Das Sarma A, Menestrina D, Parameswaran AG, Ullman JD. Fuzzy joins using MapReduce. In: Proceedings of the 28th International Conference on Data Engineering; 2012. p. 498–509.Google Scholar
  14. 14.
    Nykiel T, Potamias M, Mishra C, Kollios G, Koudas N. MRShare: sharing across multiple queries in MapReduce. Proc VLDB Endow. 2010;3(1):494–505.zbMATHCrossRefGoogle Scholar
  15. 15.
    Li F, Ooi BC, Tamer Özsu M, Wu S. Distributed data management using MapReduce. ACM Comput Surv. 2014;46(3):31:1–31:42.Google Scholar
  16. 16.
    Abouzeid A, Bajda-Pawlikowski K, Abadi DJ, Rasin A, Silberschatz A. HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc VLDB Endow. 2009;2(1):922–33.CrossRefGoogle Scholar
  17. 17.
  18. 18.
    Low Y, Gonzalez J, Kyrola A, Bickson D, Guestrin C, Hellerstein JM. Distributed GraphLab: a framework for machine learning in the cloud. Proc VLDB Endow. 2012;5(8):716–27.CrossRefGoogle Scholar
  19. 19.
    Malewicz G, Austern MH, Bik AJC, Dehnert JC, Horn I, Leiser N, Czajkowski G. Pregel: a system for large-scale graph processing. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2010. p. 135–46.Google Scholar
  20. 20.
    Jiang D, Chen G, Ooi BC, Tan K-L, Wu S. epiC: an extensible and scalable system for processing big data. Proc VLDB Endow. 2014;7(7):541–52.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Zhejiang UniversityHangzhou, ZhejiangPeople’s Republic of China

Section editors and affiliations

  • Ling Liu
    • 1
  • M. Tamer Özsu
    • 2
  1. 1.College of ComputingGeorgia Institute of TechnologyAtlantaUSA
  2. 2.Cheriton School of Computer ScienceUniversity of WaterlooWaterlooCanada