iMapReduce: A Distributed Computing Framework for Iterative Computation

Zhang, Yanfeng; Gao, Qixin; Gao, Lixin; Wang, Cuirong

doi:10.1007/s10723-012-9204-9

iMapReduce: A Distributed Computing Framework for Iterative Computation

Published: 25 March 2012

Volume 10, pages 47–68, (2012)
Cite this article

Journal of Grid Computing Aims and scope Submit manuscript

Yanfeng Zhang¹,
Qixin Gao²,
Lixin Gao³ &
…
Cuirong Wang²

1146 Accesses
107 Citations
Explore all metrics

Abstract

Iterative computation is pervasive in many applications such as data mining, web ranking, graph analysis, online social network analysis, and so on. These iterative applications typically involve massive data sets containing millions or billions of data records. This poses demand of distributed computing frameworks for processing massive data sets on a cluster of machines. MapReduce is an example of such a framework. However, MapReduce lacks built-in support for iterative process that requires to parse data sets iteratively. Besides specifying MapReduce jobs, users have to write a driver program that submits a series of jobs and performs convergence testing at the client. This paper presents iMapReduce, a distributed framework that supports iterative processing. iMapReduce allows users to specify the iterative computation with the separated map and reduce functions, and provides the support of automatic iterative processing within a single job. More importantly, iMapReduce significantly improves the performance of iterative implementations by (1) reducing the overhead of creating new MapReduce jobs repeatedly, (2) eliminating the shuffling of static data, and (3) allowing asynchronous execution of map tasks. We implement an iMapReduce prototype based on Apache Hadoop, and show that iMapReduce can achieve up to 5 times speedup over Hadoop for implementing iterative algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Amazon ec2: http://aws.amazon.com/ec2/. Accessed 2011
Baluja, S., Seth, R., Sivakumar, D., Jing, Y., Yagnik, J., Kumar, S., Ravichandran, D., Aly, M.: Video suggestion and discovery for Youtube: taking random walks through the view graph. In: Proceedings of the 17th International Conference on World Wide Web (WWW ’08), pp. 895–904 (2008)
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput Networks ISDN 30, 107–117 (1998)
Article Google Scholar
Bronshtein, I.N., Semendyayev, K.A.: Handbook of Mathematics, 3rd edn. Springer, London (1997)
MATH Google Scholar
Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: Haloop: efficient iterative data processing on large clusters. Proc. VLDB Endow. 3, 285–296 (2010)
Google Scholar
Chakrabarti, S.: Dynamic personalized pagerank in entity-relation graphs. In: Proceedings of the 16th International Conference on World Wide Web (WWW ’07), pp. 571–580 (2007)
Chu, C.T., Kim, S.K., Lin, Y.A., Yu, Y., Bradski, G.R., Ng, A.Y., Olukotun, K.: Map-reduce for machine learning on multicore. In: Proceedings of the 20th Neural Information Processing Systems (NIPS ’06), pp. 281–288 (2006)
Clauset, A., Shalizi, C.R., Newman, M.E.J.: Power-law distributions in empirical data. SIAM Rev. 51(4), 661–703 (2009)
Article MathSciNet MATH Google Scholar
Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: Mapreduce online. In: Proceedings of the 7th USENIX Conference on Networked Systems Design and Implementation (NSDI ’10), pp. 21–21 (2010)
Cormen, T.H., Stein, C., Rivest, R.L., Leiserson, C.E.: Introduction to Algorithms, 2nd edn. McGraw-Hill Higher Education (2001)
Data center wiki page: http://en.wikipedia.org/wiki/Datacenter. Accessed 2011
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)
Article Google Scholar
Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.H., Qiu, J., Fox, G.: Twister: a runtime for iterative mapreduce. In: Proceedings of the 1st International Workshop on MapReduce and its Applications (MAPREDUCE ’10), pp. 810–818 (2010)
Ekanayake, J., Pallickara, S., Fox, G.: Mapreduce for data intensive scientific analyses. In: Proceedings of the 4th IEEE International Conference on eScience (eScience ’08), pp. 277–284 (2008)
Hadoop mapreduce: http://hadoop.apache.org/. Accessed 2011
Hadoop online prototype: http://code.google.com/p/hop/. Accessed 2011
Herlocker, J.L., Konstan, J.A., Terveen, L.G., Riedl, J.T.: Evaluating collaborative filtering recommender systems. ACM Trans. Inf. Syst. 22, 5–53 (2004)
Article Google Scholar
imapreduce on Google code: http://code.google.com/p/i-mapreduce/. Accessed 2012
Kambatla, K., Rapolu, N., Jagannathan, S., Grama, A.: Asynchronous algorithms in mapreduce. In: Proceedings of the 2010 IEEE International Conference on Cluster Computing (Cluster ’10), pp. 245–254 (2010)
Kang, U., Tsourakakis, C., Faloutsos, C.: Pegasus: a peta-scale graph mining system implementation and observations. In: Proceedings of the 9th IEEE International Conference on Data Mining (ICDM ’09), pp. 229–238 (2009)
Last.fm web services: http://www.last.fm/api/. Accessed 2011
Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Statistical properties of community structure in large social and information networks. In: Proceedings of the 17th International Conference on World Wide Web (WWW ’08), pp. 695–704, (2008)
Liben-Nowell, D., Kleinberg, J.: The link-prediction problem for social networks. J. Am. Soc. Inf. Sci. Technol. 58, 1019–1031 (2007)
Article Google Scholar
Lin, J., Schatz, M.: Design patterns for efficient graph algorithms in mapreduce. In: Proceedings of the 8th Workshop on Mining and Learning with Graphs (MLG ’10), pp. 78–85 (2010)
Lloyd, S.P.: Least squares quantization in pcm. IEEE Trans. Inform. Theory 28, 129–136 (1982)
Article MathSciNet MATH Google Scholar
Logothetis, D., Olston, C., Reed, B., Webb, K.C., Yocum, K.: Stateful bulk processing for incremental analytics. In: Proceedings of the 1st ACM Symposium on Cloud Computing (SOCC ’10), pp. 51–62 (2010)
Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: Proceedings of the 28th ACM Symposium on Principles of Distributed Computing (PODC ’09), pp. 6–146 (2009)
Microsoft windows azure platform: http://www.microsoft.com/windowsazure/. Accessed 2011
Mining of massive datasets: http://infolab.stanford.edu/ullman/mmds/book.pdf. Accessed 2010
Murray, D.G., Hand, S.: Scripting the cloud with skywriting. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud ’10), p. 12 (2010)
Murray, D.G., Schwarzkopf, M., Smowton, C., Smith, S., Madhavapeddy, A., Hand, S.: Ciel: a universal execution engine for distributed data-flow computing. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (NSDI ’11), p. 9 (2011)
Page, L., Brin, S., Motwani, R., Terry, W.: The pagerank citation ranking: bringing order to the web. In: Proceedings of the 9th International Conference on World Wide Web (WWW ’98) (1998)
Peng, D., Dabe, F.: Large-scale incremental processing using distributed transactions and notifications. In: Proceedings of the 9th Conference on Symposium on Opearting Systems Design and Implementation (OSDI ’10), pp. 1–15 (2010)
Power, R., Li, J.: Piccolo: building fast, distributed programs with partitioned tables. In: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI ’10), OSDI’10, pp. 1–14 (2010)
Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., Kozyrakis, C.: Evaluating mapreduce for multi-core and multiprocessor systems. In: Proceedings of the 13th IEEE International Symposium on High Performance Computer Architecture (HPCA ’07), pp. 13–24 (2007)
Sarukkai, R.R.: Link prediction and path analysis using markov chains. Comput. Netw. 33, 377–386 (2000)
Article Google Scholar
Takács, G., Pilászy, I., Németh, B., Tikk, D.: Scalable collaborative filtering approaches for large recommender systems. J. Mach. Learn. Res. 10, 623–656 (2009)
Google Scholar
Wilson, C., Boe, B., Sala, A., Puttaswamy, K.P., Zhao, B.Y.: User interactions in social networks and their implications. In: Proceedings of the 4th ACM European Conference on Computer Systems (EuroSys ’09), pp. 205–218 (2009)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud ’10), p. 10 (2010)
Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R., Stoica, I.: Improving mapreduce performance in heterogeneous environments. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI ’08), pp. 29–42 (2008)
Zhang, Y., Gao, Q., Gao, L., Wang, C.: Imapreduce: a distributed computing framework for iterative computation. In: Proceedings of the 1st International Workshop on Data Intensive Computing in the Clouds (DataCloud ’11), p. 1112 (2011)
Zhang, Y., Gao, Q., Gao, L., Wang, C.: Priter: a distributed framework for prioritized iterative computations. In: Proceedings of the 2nd ACM Symposium on Cloud Computing (SOCC ’11), pp. 13:1–13:14 (2011)

Download references

Author information

Authors and Affiliations

School of Information Science and Engineering, Northeastern University, 11 Wenhua Road, Shenyang, Liaoning, 110819, China
Yanfeng Zhang
Department of Electrical and Information Engineering, Northeastern University at Qinhuangdao, 143 Taishan Road, Qinhuangdao, Hebei, 066000, China
Qixin Gao & Cuirong Wang
Department of Electrical and Computer Engineering, University of Massachusetts Amherst, 151 Holdsworth Way, Amherst, MA, 01002, USA
Lixin Gao

Authors

Yanfeng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Qixin Gao
View author publications
You can also search for this author in PubMed Google Scholar
Lixin Gao
View author publications
You can also search for this author in PubMed Google Scholar
Cuirong Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanfeng Zhang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, Y., Gao, Q., Gao, L. et al. iMapReduce: A Distributed Computing Framework for Iterative Computation. J Grid Computing 10, 47–68 (2012). https://doi.org/10.1007/s10723-012-9204-9

Download citation

Received: 15 July 2011
Accepted: 01 March 2012
Published: 25 March 2012
Issue Date: March 2012
DOI: https://doi.org/10.1007/s10723-012-9204-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

iMapReduce: A Distributed Computing Framework for Iterative Computation

Abstract

Access this article

Similar content being viewed by others

Algorithms for Iterative Applications in MapReduce Framework

iiHadoop: an asynchronous distributed framework for incremental iterative computations

The Emergence of Modified Hadoop Online-Based MapReduce Technology in Cloud Environments

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

iMapReduce: A Distributed Computing Framework for Iterative Computation

Abstract

Access this article

Similar content being viewed by others

Algorithms for Iterative Applications in MapReduce Framework

iiHadoop: an asynchronous distributed framework for incremental iterative computations

The Emergence of Modified Hadoop Online-Based MapReduce Technology in Cloud Environments

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation