Journal of Grid Computing

, Volume 10, Issue 1, pp 47–68 | Cite as

iMapReduce: A Distributed Computing Framework for Iterative Computation

Article

Abstract

Iterative computation is pervasive in many applications such as data mining, web ranking, graph analysis, online social network analysis, and so on. These iterative applications typically involve massive data sets containing millions or billions of data records. This poses demand of distributed computing frameworks for processing massive data sets on a cluster of machines. MapReduce is an example of such a framework. However, MapReduce lacks built-in support for iterative process that requires to parse data sets iteratively. Besides specifying MapReduce jobs, users have to write a driver program that submits a series of jobs and performs convergence testing at the client. This paper presents iMapReduce, a distributed framework that supports iterative processing. iMapReduce allows users to specify the iterative computation with the separated map and reduce functions, and provides the support of automatic iterative processing within a single job. More importantly, iMapReduce significantly improves the performance of iterative implementations by (1) reducing the overhead of creating new MapReduce jobs repeatedly, (2) eliminating the shuffling of static data, and (3) allowing asynchronous execution of map tasks. We implement an iMapReduce prototype based on Apache Hadoop, and show that iMapReduce can achieve up to 5 times speedup over Hadoop for implementing iterative algorithms.

Keywords

Iterative computation iMapReduce Distributed computing framework Hadoop 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Amazon ec2: http://aws.amazon.com/ec2/. Accessed 2011
  2. 2.
    Baluja, S., Seth, R., Sivakumar, D., Jing, Y., Yagnik, J., Kumar, S., Ravichandran, D., Aly, M.: Video suggestion and discovery for Youtube: taking random walks through the view graph. In: Proceedings of the 17th International Conference on World Wide Web (WWW ’08), pp. 895–904 (2008)Google Scholar
  3. 3.
    Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput Networks ISDN 30, 107–117 (1998)CrossRefGoogle Scholar
  4. 4.
    Bronshtein, I.N., Semendyayev, K.A.: Handbook of Mathematics, 3rd edn. Springer, London (1997)MATHGoogle Scholar
  5. 5.
    Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: Haloop: efficient iterative data processing on large clusters. Proc. VLDB Endow. 3, 285–296 (2010)Google Scholar
  6. 6.
    Chakrabarti, S.: Dynamic personalized pagerank in entity-relation graphs. In: Proceedings of the 16th International Conference on World Wide Web (WWW ’07), pp. 571–580 (2007)Google Scholar
  7. 7.
    Chu, C.T., Kim, S.K., Lin, Y.A., Yu, Y., Bradski, G.R., Ng, A.Y., Olukotun, K.: Map-reduce for machine learning on multicore. In: Proceedings of the 20th Neural Information Processing Systems (NIPS ’06), pp. 281–288 (2006)Google Scholar
  8. 8.
    Clauset, A., Shalizi, C.R., Newman, M.E.J.: Power-law distributions in empirical data. SIAM Rev. 51(4), 661–703 (2009)MathSciNetMATHCrossRefGoogle Scholar
  9. 9.
    Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: Mapreduce online. In: Proceedings of the 7th USENIX Conference on Networked Systems Design and Implementation (NSDI ’10), pp. 21–21 (2010)Google Scholar
  10. 10.
    Cormen, T.H., Stein, C., Rivest, R.L., Leiserson, C.E.: Introduction to Algorithms, 2nd edn. McGraw-Hill Higher Education (2001)Google Scholar
  11. 11.
    Data center wiki page: http://en.wikipedia.org/wiki/Datacenter. Accessed 2011
  12. 12.
    Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)CrossRefGoogle Scholar
  13. 13.
    Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.H., Qiu, J., Fox, G.: Twister: a runtime for iterative mapreduce. In: Proceedings of the 1st International Workshop on MapReduce and its Applications (MAPREDUCE ’10), pp. 810–818 (2010)Google Scholar
  14. 14.
    Ekanayake, J., Pallickara, S., Fox, G.: Mapreduce for data intensive scientific analyses. In: Proceedings of the 4th IEEE International Conference on eScience (eScience ’08), pp. 277–284 (2008)Google Scholar
  15. 15.
    Hadoop mapreduce: http://hadoop.apache.org/. Accessed 2011
  16. 16.
    Hadoop online prototype: http://code.google.com/p/hop/. Accessed 2011
  17. 17.
    Herlocker, J.L., Konstan, J.A., Terveen, L.G., Riedl, J.T.: Evaluating collaborative filtering recommender systems. ACM Trans. Inf. Syst. 22, 5–53 (2004)CrossRefGoogle Scholar
  18. 18.
    imapreduce on Google code: http://code.google.com/p/i-mapreduce/. Accessed 2012
  19. 19.
    Kambatla, K., Rapolu, N., Jagannathan, S., Grama, A.: Asynchronous algorithms in mapreduce. In: Proceedings of the 2010 IEEE International Conference on Cluster Computing (Cluster ’10), pp. 245–254 (2010)Google Scholar
  20. 20.
    Kang, U., Tsourakakis, C., Faloutsos, C.: Pegasus: a peta-scale graph mining system implementation and observations. In: Proceedings of the 9th IEEE International Conference on Data Mining (ICDM ’09), pp. 229–238 (2009)Google Scholar
  21. 21.
    Last.fm web services: http://www.last.fm/api/. Accessed 2011
  22. 22.
    Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Statistical properties of community structure in large social and information networks. In: Proceedings of the 17th International Conference on World Wide Web (WWW ’08), pp. 695–704, (2008)Google Scholar
  23. 23.
    Liben-Nowell, D., Kleinberg, J.: The link-prediction problem for social networks. J. Am. Soc. Inf. Sci. Technol. 58, 1019–1031 (2007)CrossRefGoogle Scholar
  24. 24.
    Lin, J., Schatz, M.: Design patterns for efficient graph algorithms in mapreduce. In: Proceedings of the 8th Workshop on Mining and Learning with Graphs (MLG ’10), pp. 78–85 (2010)Google Scholar
  25. 25.
    Lloyd, S.P.: Least squares quantization in pcm. IEEE Trans. Inform. Theory 28, 129–136 (1982)MathSciNetMATHCrossRefGoogle Scholar
  26. 26.
    Logothetis, D., Olston, C., Reed, B., Webb, K.C., Yocum, K.: Stateful bulk processing for incremental analytics. In: Proceedings of the 1st ACM Symposium on Cloud Computing (SOCC ’10), pp. 51–62 (2010)Google Scholar
  27. 27.
    Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: Proceedings of the 28th ACM Symposium on Principles of Distributed Computing (PODC ’09), pp. 6–146 (2009)Google Scholar
  28. 28.
    Microsoft windows azure platform: http://www.microsoft.com/windowsazure/. Accessed 2011
  29. 29.
    Mining of massive datasets: http://infolab.stanford.edu/ullman/mmds/book.pdf. Accessed 2010
  30. 30.
    Murray, D.G., Hand, S.: Scripting the cloud with skywriting. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud ’10), p. 12 (2010)Google Scholar
  31. 31.
    Murray, D.G., Schwarzkopf, M., Smowton, C., Smith, S., Madhavapeddy, A., Hand, S.: Ciel: a universal execution engine for distributed data-flow computing. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (NSDI ’11), p. 9 (2011)Google Scholar
  32. 32.
    Page, L., Brin, S., Motwani, R., Terry, W.: The pagerank citation ranking: bringing order to the web. In: Proceedings of the 9th International Conference on World Wide Web (WWW ’98) (1998)Google Scholar
  33. 33.
    Peng, D., Dabe, F.: Large-scale incremental processing using distributed transactions and notifications. In: Proceedings of the 9th Conference on Symposium on Opearting Systems Design and Implementation (OSDI ’10), pp. 1–15 (2010)Google Scholar
  34. 34.
    Power, R., Li, J.: Piccolo: building fast, distributed programs with partitioned tables. In: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI ’10), OSDI’10, pp. 1–14 (2010)Google Scholar
  35. 35.
    Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., Kozyrakis, C.: Evaluating mapreduce for multi-core and multiprocessor systems. In: Proceedings of the 13th IEEE International Symposium on High Performance Computer Architecture (HPCA ’07), pp. 13–24 (2007)Google Scholar
  36. 36.
    Sarukkai, R.R.: Link prediction and path analysis using markov chains. Comput. Netw. 33, 377–386 (2000)CrossRefGoogle Scholar
  37. 37.
    Takács, G., Pilászy, I., Németh, B., Tikk, D.: Scalable collaborative filtering approaches for large recommender systems. J. Mach. Learn. Res. 10, 623–656 (2009)Google Scholar
  38. 38.
    Wilson, C., Boe, B., Sala, A., Puttaswamy, K.P., Zhao, B.Y.: User interactions in social networks and their implications. In: Proceedings of the 4th ACM European Conference on Computer Systems (EuroSys ’09), pp. 205–218 (2009)Google Scholar
  39. 39.
    Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud ’10), p. 10 (2010)Google Scholar
  40. 40.
    Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R., Stoica, I.: Improving mapreduce performance in heterogeneous environments. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI ’08), pp. 29–42 (2008)Google Scholar
  41. 41.
    Zhang, Y., Gao, Q., Gao, L., Wang, C.: Imapreduce: a distributed computing framework for iterative computation. In: Proceedings of the 1st International Workshop on Data Intensive Computing in the Clouds (DataCloud ’11), p. 1112 (2011)Google Scholar
  42. 42.
    Zhang, Y., Gao, Q., Gao, L., Wang, C.: Priter: a distributed framework for prioritized iterative computations. In: Proceedings of the 2nd ACM Symposium on Cloud Computing (SOCC ’11), pp. 13:1–13:14 (2011)Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2012

Authors and Affiliations

  • Yanfeng Zhang
    • 1
  • Qixin Gao
    • 2
  • Lixin Gao
    • 3
  • Cuirong Wang
    • 2
  1. 1.School of Information Science and EngineeringNortheastern UniversityShenyangChina
  2. 2.Department of Electrical and Information EngineeringNortheastern University at QinhuangdaoQinhuangdaoChina
  3. 3.Department of Electrical and Computer EngineeringUniversity of Massachusetts AmherstAmherstUSA

Personalised recommendations