Skip to main content
Log in

Functional Models of Hadoop MapReduce with Application to Scan

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

MapReduce, first proposed by Google, is a remarkable programming model for processing very large amounts of data. An open-source implementation of MapReduce, called Hadoop, is now used for developing a wide range of applications. Although developing a correct and efficient program on MapReduce is much easier than developing one with MPI etc., it is still nontrivial if the target application requires involved functionalities of Hadoop MapReduce. Under these situations, functional models for MapReduce computation play important roles because we can utilize them for better understanding, proving the correctness, and even optimization of MapReduce programs. In this paper, we develop two functional models, a low-level one and a high-level one, which capture the semantics of Hadoop MapReduce computation. We discuss the detailed semantics mainly in terms of the following two computations: the computation of Mapper and Reducer classes and the computation in the Shuffle phase with the secondary-sorting technique. In addition, we develop MapReduce algorithms for the scan computational pattern (prefix sums) on the newly proposed models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. It is not trivial to give a definition of \( Bag \) supporting its nondeterministic behavior. One way is to define the structure in the same way as the list with simulating the behavior by permutation of elements.

  2. We will use the term “phase” for the models of computation and “task” for the implementation. In the implementation, MapReduce consists of two sets of tasks: the mapper tasks work from the input to the end of the Map phase, and the reducer tasks work from the Shuffle phase to the end of the output.

  3. The function getPartition (\( hashP \)) can take a value as well as a key. The author found hardly any applications in which the function uses the value. In the high-level model given in Sect. 5, we do not use the value for partitioning.

  4. Triples (b1, s1, l1) and (b2, s2, l2) represent two values to be compared and stored on byte streams.

  5. If we use the definition \(a \equiv b \Longleftarrow |a - b| < 2\), then the list [3, 4, 5, 7] will be grouped as [3, 4, 5] and [7], not [3, 4] and [5] and [7]. In Haskell, there is a function \( Data.List.groupBy \mathbin {::}(\alpha \rightarrow \alpha \rightarrow Bool )\rightarrow [\alpha ]\rightarrow [[\alpha ]]\) that has similar functionality, but it returns [[3, 4], [5], [7]] for this case.

  6. Readers who know Haskell well would write the definition with a state-monadic function.

References

  1. Apache Software Foundation: Hadoop. http://hadoop.apache.org/ (April 2015)

  2. Apache Software Foundation: Hadoop Wiki: PoweredBy. http://wiki.apache.org/hadoop/PoweredBy (April 2015)

  3. Berthold, J., Dieterle, M., Loogen, R.: Implementing parallel Google Map-Reduce in Eden. In: Euro-Par 2009 Parallel Processing, 15th International Euro-Par Conference, Delft, The Netherlands, August 25–28, 2009. Proceedings, Lecture Notes in Computer Science. Springer, Berlin, vol. 5704, pp. 990–1002 (2009)

  4. Bird, R.: Introduction to Functional Programming using Haskell. Prentice-Hall, New York (1998)

    Google Scholar 

  5. Bird, R.S.: An introduction to the theory of lists. In: Proceedings of the NATO Advanced Study Institute on Logic of Programming and Calculi of Discrete Design, pp. 5–42. Springer, New York, Inc. (1987)

  6. Blelloch, G.E.: Scans as primitive parallel operations. IEEE Trans. Comput. 38(11), 1526–1538 (1989)

    Article  Google Scholar 

  7. Breshears, C.: The Art of Concurrency. Oreilly & Associates Inc, Sebastopol (2009)

    Google Scholar 

  8. Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: 6th Symposium on Operating System Design and Implementation (OSDI2004), December 6–8, 2004, San Francisco, California, USA, pp. 137–150 (2004)

  9. Dörre, J., Apel, S., Lengauer, C.: Static type checking of Hadoop MapReduce programs. In: Proceedings of the Second International Workshop on MapReduce and Its Applications (MapReduce ’11), ACM, New York, pp. 17–24 (2011)

  10. Dörre, J., Apel, S., Lengauer, C.: Modeling and optimizing MapReduce programs. Concurr. Comput. Pract. Exp. 27(7), 1734–1766 (2014)

  11. Feldman, J., Muthukrishnan, S., Sidiropoulos, A., Stein, C., Svitkina, Z.: On distributing symmetric streaming computations. In: Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA ’08), Society for Industrial and Applied Mathematics, pp. 710–719 (2008)

  12. Hu, Z., Iwasaki, H., Takeichi, M.: Calculating accumulations. New Gener. Comput. 17, 153–173 (1999)

    Article  Google Scholar 

  13. Jiang, F., Tanabe, Y., Honiden, S.: Verification of Hadoop MapReduce application and Scala program extraction using Coq. IEICE Jpn. J. D J97–D(3), 625–634 (2014)

    Google Scholar 

  14. Karloff, H., Suri, S., Vassilvitskii, S.: A model of computation for MapReduce. In: Proceedings of the Twenty-first Annual ACM-SIAM Symposium on Discrete Algorithms (SODA ’10). Society for Industrial and Applied Mathematics, pp. 938–948 (2010)

  15. Lämmel, R.: Google’s MapReduce programming model—revisited. Sci. Comput. Program. 70(1), 1–30 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  16. Lee, K.H., Lee, Y.J., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing with MapReduce: a survey. SIGMOD Record 40(4), 11–20 (2012)

    Article  Google Scholar 

  17. Loulergue, F., Gava, F., Kosmatov, N., Lemerre, M.: Towards verified cloud computing environments. In: Smari, W.W., Zeljkovic, V. (eds.) 2012 International Conference on High Performance Computing & Simulation, HPCS 2012, pp. 91–97. IEEE, Silver Spring, MD (2012)

    Chapter  Google Scholar 

  18. Ogawa, H., Nakada, H., Takano, R., Kudoh, T.: SSS: An implementation of key-value store based MapReduce framework. In: Proceedings of 2nd International Conference on Cloud Computing Technology and Science, pp. 745–761 (2010)

  19. Ono, K., Hirai, Y., Tanabe, Y., Noda, N., Hagiya, M.: Using Coq in specification and program extraction of Hadoop MapReduce applications. In: Proceedings of the 9th International Conference on Software Engineering and Formal Methods (SEFM’11), pp. 350–365. Springer, Berlin (2011)

  20. Pace, M.F.: BSP vs MapReduce. Procedia Comput. Sci. 9, 246–255 (2012)

    Article  Google Scholar 

  21. Pereverzeva, I., Butler, M., Fathabadi, A.S., Laibinis, L., Troubitsyna, E.: Formal derivation of distributed MapReduce. Tech. Rep. 1099, TUCS (2014)

  22. Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., Kozyrakis, C.: Evaluating MapReduce for multi-core and multiprocessor systems. In: Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture (HPCA ’07), pp. 13–24. IEEE Computer Society, Silver Spring, MD (2007)

  23. Suenaga, K.: Personal communication

  24. Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33(8), 103–111 (1990)

    Article  Google Scholar 

  25. White, T.: Hadoop: The Definitive Guide. O’Reilly Media/Yahoo Press, Sebastopol (2012)

    Google Scholar 

  26. Yang, F., Su, W., Zhu, H., Li, Q.: Formalizing MapReduce with CSP. In: Proceedings of the 2010 17th IEEE International Conference and Workshops on the Engineering of Computer-Based Systems (ECBS ’10). IEEE Computer Society, Silver Spring, MD pp. 358–367 (2010)

  27. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud’10), pp. 10. USENIX Association, Berkeley (2010)

Download references

Acknowledgments

The author thanks Yu Liu, Kento Emoto, and Le-Duc Tung for helpful discussion on the implementation of scans. In particular, the BSP-inspired algorithm on MapReduce was first suggested by Le-Duc Tung. Part of this work was conducted as part of the PaPDAS Project supported by ANR (ANR-2010-INTB-0205-02) and JST (10102704).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kiminori Matsuzaki.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Matsuzaki, K. Functional Models of Hadoop MapReduce with Application to Scan. Int J Parallel Prog 45, 362–381 (2017). https://doi.org/10.1007/s10766-016-0414-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-016-0414-9

Keywords

Navigation