## Abstract

MapReduce, first proposed by Google, is a remarkable programming model for processing very large amounts of data. An open-source implementation of MapReduce, called Hadoop, is now used for developing a wide range of applications. Although developing a correct and efficient program on MapReduce is much easier than developing one with MPI etc., it is still nontrivial if the target application requires involved functionalities of Hadoop MapReduce. Under these situations, functional models for MapReduce computation play important roles because we can utilize them for better understanding, proving the correctness, and even optimization of MapReduce programs. In this paper, we develop two functional models, a low-level one and a high-level one, which capture the semantics of Hadoop MapReduce computation. We discuss the detailed semantics mainly in terms of the following two computations: the computation of Mapper and Reducer classes and the computation in the Shuffle phase with the secondary-sorting technique. In addition, we develop MapReduce algorithms for the scan computational pattern (prefix sums) on the newly proposed models.

### Similar content being viewed by others

## Notes

It is not trivial to give a definition of \( Bag \) supporting its nondeterministic behavior. One way is to define the structure in the same way as the list with simulating the behavior by permutation of elements.

We will use the term “phase” for the models of computation and “task” for the implementation. In the implementation, MapReduce consists of two sets of tasks: the mapper tasks work from the input to the end of the Map phase, and the reducer tasks work from the Shuffle phase to the end of the output.

The function getPartition (\( hashP \)) can take a value as well as a key. The author found hardly any applications in which the function uses the value. In the high-level model given in Sect. 5, we do not use the value for partitioning.

Triples (b1, s1, l1) and (b2, s2, l2) represent two values to be compared and stored on byte streams.

If we use the definition \(a \equiv b \Longleftarrow |a - b| < 2\), then the list [3, 4, 5, 7] will be grouped as [3, 4, 5] and [7], not [3, 4] and [5] and [7]. In Haskell, there is a function \( Data.List.groupBy \mathbin {::}(\alpha \rightarrow \alpha \rightarrow Bool )\rightarrow [\alpha ]\rightarrow [[\alpha ]]\) that has similar functionality, but it returns [[3, 4], [5], [7]] for this case.

Readers who know Haskell well would write the definition with a state-monadic function.

## References

Apache Software Foundation: Hadoop. http://hadoop.apache.org/ (April 2015)

Apache Software Foundation: Hadoop Wiki: PoweredBy. http://wiki.apache.org/hadoop/PoweredBy (April 2015)

Berthold, J., Dieterle, M., Loogen, R.: Implementing parallel Google Map-Reduce in Eden. In: Euro-Par 2009 Parallel Processing, 15th International Euro-Par Conference, Delft, The Netherlands, August 25–28, 2009. Proceedings, Lecture Notes in Computer Science. Springer, Berlin, vol.

**5704**, pp. 990–1002 (2009)Bird, R.: Introduction to Functional Programming using Haskell. Prentice-Hall, New York (1998)

Bird, R.S.: An introduction to the theory of lists. In: Proceedings of the NATO Advanced Study Institute on Logic of Programming and Calculi of Discrete Design, pp. 5–42. Springer, New York, Inc. (1987)

Blelloch, G.E.: Scans as primitive parallel operations. IEEE Trans. Comput.

**38**(11), 1526–1538 (1989)Breshears, C.: The Art of Concurrency. Oreilly & Associates Inc, Sebastopol (2009)

Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: 6th Symposium on Operating System Design and Implementation (OSDI2004), December 6–8, 2004, San Francisco, California, USA, pp. 137–150 (2004)

Dörre, J., Apel, S., Lengauer, C.: Static type checking of Hadoop MapReduce programs. In: Proceedings of the Second International Workshop on MapReduce and Its Applications (MapReduce ’11), ACM, New York, pp. 17–24 (2011)

Dörre, J., Apel, S., Lengauer, C.: Modeling and optimizing MapReduce programs. Concurr. Comput. Pract. Exp.

**27**(7), 1734–1766 (2014)Feldman, J., Muthukrishnan, S., Sidiropoulos, A., Stein, C., Svitkina, Z.: On distributing symmetric streaming computations. In: Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA ’08), Society for Industrial and Applied Mathematics, pp. 710–719 (2008)

Hu, Z., Iwasaki, H., Takeichi, M.: Calculating accumulations. New Gener. Comput.

**17**, 153–173 (1999)Jiang, F., Tanabe, Y., Honiden, S.: Verification of Hadoop MapReduce application and Scala program extraction using Coq. IEICE Jpn. J. D

**J97–D**(3), 625–634 (2014)Karloff, H., Suri, S., Vassilvitskii, S.: A model of computation for MapReduce. In: Proceedings of the Twenty-first Annual ACM-SIAM Symposium on Discrete Algorithms (SODA ’10). Society for Industrial and Applied Mathematics, pp. 938–948 (2010)

Lämmel, R.: Google’s MapReduce programming model—revisited. Sci. Comput. Program.

**70**(1), 1–30 (2008)Lee, K.H., Lee, Y.J., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing with MapReduce: a survey. SIGMOD Record

**40**(4), 11–20 (2012)Loulergue, F., Gava, F., Kosmatov, N., Lemerre, M.: Towards verified cloud computing environments. In: Smari, W.W., Zeljkovic, V. (eds.) 2012 International Conference on High Performance Computing & Simulation, HPCS 2012, pp. 91–97. IEEE, Silver Spring, MD (2012)

Ogawa, H., Nakada, H., Takano, R., Kudoh, T.: SSS: An implementation of key-value store based MapReduce framework. In: Proceedings of 2nd International Conference on Cloud Computing Technology and Science, pp. 745–761 (2010)

Ono, K., Hirai, Y., Tanabe, Y., Noda, N., Hagiya, M.: Using Coq in specification and program extraction of Hadoop MapReduce applications. In: Proceedings of the 9th International Conference on Software Engineering and Formal Methods (SEFM’11), pp. 350–365. Springer, Berlin (2011)

Pace, M.F.: BSP vs MapReduce. Procedia Comput. Sci.

**9**, 246–255 (2012)Pereverzeva, I., Butler, M., Fathabadi, A.S., Laibinis, L., Troubitsyna, E.: Formal derivation of distributed MapReduce. Tech. Rep. 1099, TUCS (2014)

Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., Kozyrakis, C.: Evaluating MapReduce for multi-core and multiprocessor systems. In: Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture (HPCA ’07), pp. 13–24. IEEE Computer Society, Silver Spring, MD (2007)

Suenaga, K.: Personal communication

Valiant, L.G.: A bridging model for parallel computation. Commun. ACM

**33**(8), 103–111 (1990)White, T.: Hadoop: The Definitive Guide. O’Reilly Media/Yahoo Press, Sebastopol (2012)

Yang, F., Su, W., Zhu, H., Li, Q.: Formalizing MapReduce with CSP. In: Proceedings of the 2010 17th IEEE International Conference and Workshops on the Engineering of Computer-Based Systems (ECBS ’10). IEEE Computer Society, Silver Spring, MD pp. 358–367 (2010)

Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud’10), pp. 10. USENIX Association, Berkeley (2010)

## Acknowledgments

The author thanks Yu Liu, Kento Emoto, and Le-Duc Tung for helpful discussion on the implementation of scans. In particular, the BSP-inspired algorithm on MapReduce was first suggested by Le-Duc Tung. Part of this work was conducted as part of the PaPDAS Project supported by ANR (ANR-2010-INTB-0205-02) and JST (10102704).

## Author information

### Authors and Affiliations

### Corresponding author

## Rights and permissions

## About this article

### Cite this article

Matsuzaki, K. Functional Models of Hadoop MapReduce with Application to Scan.
*Int J Parallel Prog* **45**, 362–381 (2017). https://doi.org/10.1007/s10766-016-0414-9

Received:

Accepted:

Published:

Issue Date:

DOI: https://doi.org/10.1007/s10766-016-0414-9