Skip to main content
Log in

iMapReduce: A Distributed Computing Framework for Iterative Computation

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

Iterative computation is pervasive in many applications such as data mining, web ranking, graph analysis, online social network analysis, and so on. These iterative applications typically involve massive data sets containing millions or billions of data records. This poses demand of distributed computing frameworks for processing massive data sets on a cluster of machines. MapReduce is an example of such a framework. However, MapReduce lacks built-in support for iterative process that requires to parse data sets iteratively. Besides specifying MapReduce jobs, users have to write a driver program that submits a series of jobs and performs convergence testing at the client. This paper presents iMapReduce, a distributed framework that supports iterative processing. iMapReduce allows users to specify the iterative computation with the separated map and reduce functions, and provides the support of automatic iterative processing within a single job. More importantly, iMapReduce significantly improves the performance of iterative implementations by (1) reducing the overhead of creating new MapReduce jobs repeatedly, (2) eliminating the shuffling of static data, and (3) allowing asynchronous execution of map tasks. We implement an iMapReduce prototype based on Apache Hadoop, and show that iMapReduce can achieve up to 5 times speedup over Hadoop for implementing iterative algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Amazon ec2: http://aws.amazon.com/ec2/. Accessed 2011

  2. Baluja, S., Seth, R., Sivakumar, D., Jing, Y., Yagnik, J., Kumar, S., Ravichandran, D., Aly, M.: Video suggestion and discovery for Youtube: taking random walks through the view graph. In: Proceedings of the 17th International Conference on World Wide Web (WWW ’08), pp. 895–904 (2008)

  3. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput Networks ISDN 30, 107–117 (1998)

    Article  Google Scholar 

  4. Bronshtein, I.N., Semendyayev, K.A.: Handbook of Mathematics, 3rd edn. Springer, London (1997)

    MATH  Google Scholar 

  5. Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: Haloop: efficient iterative data processing on large clusters. Proc. VLDB Endow. 3, 285–296 (2010)

    Google Scholar 

  6. Chakrabarti, S.: Dynamic personalized pagerank in entity-relation graphs. In: Proceedings of the 16th International Conference on World Wide Web (WWW ’07), pp. 571–580 (2007)

  7. Chu, C.T., Kim, S.K., Lin, Y.A., Yu, Y., Bradski, G.R., Ng, A.Y., Olukotun, K.: Map-reduce for machine learning on multicore. In: Proceedings of the 20th Neural Information Processing Systems (NIPS ’06), pp. 281–288 (2006)

  8. Clauset, A., Shalizi, C.R., Newman, M.E.J.: Power-law distributions in empirical data. SIAM Rev. 51(4), 661–703 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  9. Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: Mapreduce online. In: Proceedings of the 7th USENIX Conference on Networked Systems Design and Implementation (NSDI ’10), pp. 21–21 (2010)

  10. Cormen, T.H., Stein, C., Rivest, R.L., Leiserson, C.E.: Introduction to Algorithms, 2nd edn. McGraw-Hill Higher Education (2001)

  11. Data center wiki page: http://en.wikipedia.org/wiki/Datacenter. Accessed 2011

  12. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)

    Article  Google Scholar 

  13. Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.H., Qiu, J., Fox, G.: Twister: a runtime for iterative mapreduce. In: Proceedings of the 1st International Workshop on MapReduce and its Applications (MAPREDUCE ’10), pp. 810–818 (2010)

  14. Ekanayake, J., Pallickara, S., Fox, G.: Mapreduce for data intensive scientific analyses. In: Proceedings of the 4th IEEE International Conference on eScience (eScience ’08), pp. 277–284 (2008)

  15. Hadoop mapreduce: http://hadoop.apache.org/. Accessed 2011

  16. Hadoop online prototype: http://code.google.com/p/hop/. Accessed 2011

  17. Herlocker, J.L., Konstan, J.A., Terveen, L.G., Riedl, J.T.: Evaluating collaborative filtering recommender systems. ACM Trans. Inf. Syst. 22, 5–53 (2004)

    Article  Google Scholar 

  18. imapreduce on Google code: http://code.google.com/p/i-mapreduce/. Accessed 2012

  19. Kambatla, K., Rapolu, N., Jagannathan, S., Grama, A.: Asynchronous algorithms in mapreduce. In: Proceedings of the 2010 IEEE International Conference on Cluster Computing (Cluster ’10), pp. 245–254 (2010)

  20. Kang, U., Tsourakakis, C., Faloutsos, C.: Pegasus: a peta-scale graph mining system implementation and observations. In: Proceedings of the 9th IEEE International Conference on Data Mining (ICDM ’09), pp. 229–238 (2009)

  21. Last.fm web services: http://www.last.fm/api/. Accessed 2011

  22. Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Statistical properties of community structure in large social and information networks. In: Proceedings of the 17th International Conference on World Wide Web (WWW ’08), pp. 695–704, (2008)

  23. Liben-Nowell, D., Kleinberg, J.: The link-prediction problem for social networks. J. Am. Soc. Inf. Sci. Technol. 58, 1019–1031 (2007)

    Article  Google Scholar 

  24. Lin, J., Schatz, M.: Design patterns for efficient graph algorithms in mapreduce. In: Proceedings of the 8th Workshop on Mining and Learning with Graphs (MLG ’10), pp. 78–85 (2010)

  25. Lloyd, S.P.: Least squares quantization in pcm. IEEE Trans. Inform. Theory 28, 129–136 (1982)

    Article  MathSciNet  MATH  Google Scholar 

  26. Logothetis, D., Olston, C., Reed, B., Webb, K.C., Yocum, K.: Stateful bulk processing for incremental analytics. In: Proceedings of the 1st ACM Symposium on Cloud Computing (SOCC ’10), pp. 51–62 (2010)

  27. Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: Proceedings of the 28th ACM Symposium on Principles of Distributed Computing (PODC ’09), pp. 6–146 (2009)

  28. Microsoft windows azure platform: http://www.microsoft.com/windowsazure/. Accessed 2011

  29. Mining of massive datasets: http://infolab.stanford.edu/ullman/mmds/book.pdf. Accessed 2010

  30. Murray, D.G., Hand, S.: Scripting the cloud with skywriting. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud ’10), p. 12 (2010)

  31. Murray, D.G., Schwarzkopf, M., Smowton, C., Smith, S., Madhavapeddy, A., Hand, S.: Ciel: a universal execution engine for distributed data-flow computing. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (NSDI ’11), p. 9 (2011)

  32. Page, L., Brin, S., Motwani, R., Terry, W.: The pagerank citation ranking: bringing order to the web. In: Proceedings of the 9th International Conference on World Wide Web (WWW ’98) (1998)

  33. Peng, D., Dabe, F.: Large-scale incremental processing using distributed transactions and notifications. In: Proceedings of the 9th Conference on Symposium on Opearting Systems Design and Implementation (OSDI ’10), pp. 1–15 (2010)

  34. Power, R., Li, J.: Piccolo: building fast, distributed programs with partitioned tables. In: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI ’10), OSDI’10, pp. 1–14 (2010)

  35. Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., Kozyrakis, C.: Evaluating mapreduce for multi-core and multiprocessor systems. In: Proceedings of the 13th IEEE International Symposium on High Performance Computer Architecture (HPCA ’07), pp. 13–24 (2007)

  36. Sarukkai, R.R.: Link prediction and path analysis using markov chains. Comput. Netw. 33, 377–386 (2000)

    Article  Google Scholar 

  37. Takács, G., Pilászy, I., Németh, B., Tikk, D.: Scalable collaborative filtering approaches for large recommender systems. J. Mach. Learn. Res. 10, 623–656 (2009)

    Google Scholar 

  38. Wilson, C., Boe, B., Sala, A., Puttaswamy, K.P., Zhao, B.Y.: User interactions in social networks and their implications. In: Proceedings of the 4th ACM European Conference on Computer Systems (EuroSys ’09), pp. 205–218 (2009)

  39. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud ’10), p. 10 (2010)

  40. Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R., Stoica, I.: Improving mapreduce performance in heterogeneous environments. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI ’08), pp. 29–42 (2008)

  41. Zhang, Y., Gao, Q., Gao, L., Wang, C.: Imapreduce: a distributed computing framework for iterative computation. In: Proceedings of the 1st International Workshop on Data Intensive Computing in the Clouds (DataCloud ’11), p. 1112 (2011)

  42. Zhang, Y., Gao, Q., Gao, L., Wang, C.: Priter: a distributed framework for prioritized iterative computations. In: Proceedings of the 2nd ACM Symposium on Cloud Computing (SOCC ’11), pp. 13:1–13:14 (2011)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yanfeng Zhang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, Y., Gao, Q., Gao, L. et al. iMapReduce: A Distributed Computing Framework for Iterative Computation. J Grid Computing 10, 47–68 (2012). https://doi.org/10.1007/s10723-012-9204-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10723-012-9204-9

Keywords

Navigation