International Journal of Parallel Programming

, Volume 45, Issue 2, pp 362–381 | Cite as

Functional Models of Hadoop MapReduce with Application to Scan

  • Kiminori Matsuzaki


MapReduce, first proposed by Google, is a remarkable programming model for processing very large amounts of data. An open-source implementation of MapReduce, called Hadoop, is now used for developing a wide range of applications. Although developing a correct and efficient program on MapReduce is much easier than developing one with MPI etc., it is still nontrivial if the target application requires involved functionalities of Hadoop MapReduce. Under these situations, functional models for MapReduce computation play important roles because we can utilize them for better understanding, proving the correctness, and even optimization of MapReduce programs. In this paper, we develop two functional models, a low-level one and a high-level one, which capture the semantics of Hadoop MapReduce computation. We discuss the detailed semantics mainly in terms of the following two computations: the computation of Mapper and Reducer classes and the computation in the Shuffle phase with the secondary-sorting technique. In addition, we develop MapReduce algorithms for the scan computational pattern (prefix sums) on the newly proposed models.


MapReduce Functional model Hadoop 



The author thanks Yu Liu, Kento Emoto, and Le-Duc Tung for helpful discussion on the implementation of scans. In particular, the BSP-inspired algorithm on MapReduce was first suggested by Le-Duc Tung. Part of this work was conducted as part of the PaPDAS Project supported by ANR (ANR-2010-INTB-0205-02) and JST (10102704).


  1. 1.
    Apache Software Foundation: Hadoop. (April 2015)
  2. 2.
    Apache Software Foundation: Hadoop Wiki: PoweredBy. (April 2015)
  3. 3.
    Berthold, J., Dieterle, M., Loogen, R.: Implementing parallel Google Map-Reduce in Eden. In: Euro-Par 2009 Parallel Processing, 15th International Euro-Par Conference, Delft, The Netherlands, August 25–28, 2009. Proceedings, Lecture Notes in Computer Science. Springer, Berlin, vol. 5704, pp. 990–1002 (2009)Google Scholar
  4. 4.
    Bird, R.: Introduction to Functional Programming using Haskell. Prentice-Hall, New York (1998)Google Scholar
  5. 5.
    Bird, R.S.: An introduction to the theory of lists. In: Proceedings of the NATO Advanced Study Institute on Logic of Programming and Calculi of Discrete Design, pp. 5–42. Springer, New York, Inc. (1987)Google Scholar
  6. 6.
    Blelloch, G.E.: Scans as primitive parallel operations. IEEE Trans. Comput. 38(11), 1526–1538 (1989)CrossRefGoogle Scholar
  7. 7.
    Breshears, C.: The Art of Concurrency. Oreilly & Associates Inc, Sebastopol (2009)Google Scholar
  8. 8.
    Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: 6th Symposium on Operating System Design and Implementation (OSDI2004), December 6–8, 2004, San Francisco, California, USA, pp. 137–150 (2004)Google Scholar
  9. 9.
    Dörre, J., Apel, S., Lengauer, C.: Static type checking of Hadoop MapReduce programs. In: Proceedings of the Second International Workshop on MapReduce and Its Applications (MapReduce ’11), ACM, New York, pp. 17–24 (2011)Google Scholar
  10. 10.
    Dörre, J., Apel, S., Lengauer, C.: Modeling and optimizing MapReduce programs. Concurr. Comput. Pract. Exp. 27(7), 1734–1766 (2014)Google Scholar
  11. 11.
    Feldman, J., Muthukrishnan, S., Sidiropoulos, A., Stein, C., Svitkina, Z.: On distributing symmetric streaming computations. In: Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA ’08), Society for Industrial and Applied Mathematics, pp. 710–719 (2008)Google Scholar
  12. 12.
    Hu, Z., Iwasaki, H., Takeichi, M.: Calculating accumulations. New Gener. Comput. 17, 153–173 (1999)CrossRefGoogle Scholar
  13. 13.
    Jiang, F., Tanabe, Y., Honiden, S.: Verification of Hadoop MapReduce application and Scala program extraction using Coq. IEICE Jpn. J. D J97–D(3), 625–634 (2014)Google Scholar
  14. 14.
    Karloff, H., Suri, S., Vassilvitskii, S.: A model of computation for MapReduce. In: Proceedings of the Twenty-first Annual ACM-SIAM Symposium on Discrete Algorithms (SODA ’10). Society for Industrial and Applied Mathematics, pp. 938–948 (2010)Google Scholar
  15. 15.
    Lämmel, R.: Google’s MapReduce programming model—revisited. Sci. Comput. Program. 70(1), 1–30 (2008)MathSciNetCrossRefzbMATHGoogle Scholar
  16. 16.
    Lee, K.H., Lee, Y.J., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing with MapReduce: a survey. SIGMOD Record 40(4), 11–20 (2012)CrossRefGoogle Scholar
  17. 17.
    Loulergue, F., Gava, F., Kosmatov, N., Lemerre, M.: Towards verified cloud computing environments. In: Smari, W.W., Zeljkovic, V. (eds.) 2012 International Conference on High Performance Computing & Simulation, HPCS 2012, pp. 91–97. IEEE, Silver Spring, MD (2012)CrossRefGoogle Scholar
  18. 18.
    Ogawa, H., Nakada, H., Takano, R., Kudoh, T.: SSS: An implementation of key-value store based MapReduce framework. In: Proceedings of 2nd International Conference on Cloud Computing Technology and Science, pp. 745–761 (2010)Google Scholar
  19. 19.
    Ono, K., Hirai, Y., Tanabe, Y., Noda, N., Hagiya, M.: Using Coq in specification and program extraction of Hadoop MapReduce applications. In: Proceedings of the 9th International Conference on Software Engineering and Formal Methods (SEFM’11), pp. 350–365. Springer, Berlin (2011)Google Scholar
  20. 20.
    Pace, M.F.: BSP vs MapReduce. Procedia Comput. Sci. 9, 246–255 (2012)CrossRefGoogle Scholar
  21. 21.
    Pereverzeva, I., Butler, M., Fathabadi, A.S., Laibinis, L., Troubitsyna, E.: Formal derivation of distributed MapReduce. Tech. Rep. 1099, TUCS (2014)Google Scholar
  22. 22.
    Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., Kozyrakis, C.: Evaluating MapReduce for multi-core and multiprocessor systems. In: Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture (HPCA ’07), pp. 13–24. IEEE Computer Society, Silver Spring, MD (2007)Google Scholar
  23. 23.
    Suenaga, K.: Personal communicationGoogle Scholar
  24. 24.
    Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33(8), 103–111 (1990)CrossRefGoogle Scholar
  25. 25.
    White, T.: Hadoop: The Definitive Guide. O’Reilly Media/Yahoo Press, Sebastopol (2012)Google Scholar
  26. 26.
    Yang, F., Su, W., Zhu, H., Li, Q.: Formalizing MapReduce with CSP. In: Proceedings of the 2010 17th IEEE International Conference and Workshops on the Engineering of Computer-Based Systems (ECBS ’10). IEEE Computer Society, Silver Spring, MD pp. 358–367 (2010)Google Scholar
  27. 27.
    Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud’10), pp. 10. USENIX Association, Berkeley (2010)Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.Kochi University of TechnologyKamiJapan

Personalised recommendations