MapReduce Algorithms for Big Data Analysis

Shim, Kyuseok

doi:10.1007/978-3-642-37134-9_3

MapReduce Algorithms for Big Data Analysis

Kyuseok Shim¹⁷

Conference paper

2220 Accesses
16 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7813))

Abstract

As there is an increasing trend of applications being expected to deal with big data that usually do not fit in the main memory of a single machine, analyzing big data is a challenging problem today. For such data-intensive applications, the MapReduce framework has recently attracted considerable attention and started to be investigated as a cost effective option to implement scalable parallel algorithms for big data analysis which can handle petabytes of data for millions of users. MapReduce is a programming model that allows easy development of scalable parallel applications to process big data on large clusters of commodity machines. Google’s MapReduce or its open-source equivalent Hadoop is a powerful tool for building such applications.

In this tutorial, we will introduce the MapReduce framework based on Hadoop and present the state-of-the-art in MapReduce algorithms for query processing, data analysis and data mining. The intended audience of this tutorial is professionals who plan to design and develop MapReduce algorithms and researchers who should be aware of the state-of-the-art in MapReduce algorithms available today for big data analysis.

A full version of this tutorial was previously presented in VLDB 2012.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: OSDI (2004)
Google Scholar
Apache: Apache Hadoop (2010), http://hadoop.apache.org
Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in MapReduce. In: SIGMOD (2010)
Google Scholar
Okcan, A., Riedewald, M.: Processing theta-joins using MapReduce. In: SIGMOD (2011)
Google Scholar
Baraglia, R., Morales, G.D.F., Lucchese, C.: Document similarity self-join with MapReduce. In: ICDM (2010)
Google Scholar
Kim, Y., Shim, K.: Parallel top-k similarity join algorithms using MapReduce. In: ICDE (2012)
Google Scholar
Metwally, A., Faloutsos, C.: V-SMART-Join: A scalable MapReduce framework for all-pair similarity joins of multisets and vectors. In: VLDB (2012)
Google Scholar
Elsayed, T., Lin, J., Oard, D.W.: Pairwise document similarity in large collections with MapReduce. In: HLT (2008)
Google Scholar
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: SIGMOD (2010)
Google Scholar
Afrati, F., Ullman, J.D.: Optimizing joins in a Map-Reduce environment. In: VLDB (2009)
Google Scholar
Chen, L., Zhang, X., Wang, M.: Efficient multiwaytheta join processing using mapreduce. VLDB (2012)
Google Scholar
Wu, S., Li, F., Mehrotra, S., Ooi, B.C.: Query optimization for massively parallel data processin. In: SOCC (2011)
Google Scholar
Sun, T., Shuy, C., Liy, F., Yuy, H., Ma, L., Fang, Y.: An efficient hierarchical clustering method for large datasets with Map-Reduce. In: PDCAT (2009)
Google Scholar
He, Y., Tan, H., Luo, W., Mao, H., Ma, D., Feng, S., Fan, J.: Mr-dbscan: An efficient parallel density-based clustering algorithm using MapReduce. In: ICPADS (2011)
Google Scholar
Deodhar, M., Jones, C., Ghosh, J.: Parallel simultaneous co-clustering and learning with Map-Reduce. In: GrC (2000)
Google Scholar
Papadimitriou, S., Sun, J.: DisCo: Distributed co-clustering with Map-Reduce: A case study towards petabyte-scale end-to-end mining. In: ICDM (2008)
Google Scholar
Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.: PFP: Parallel FP-Growth for query recommendation. ACM Recommender Systems (2008)
Google Scholar
Panda, B., Herbach, J.S., Basu, S., Bayardo, R.J.: Planet: Massively parallel learning of tree ensembles with MapReduce. In: VLDB (2012)
Google Scholar
Liu, C.,Yang, H.-C., J.F.L.W.H.Y.M.W.: Distributed nonnegative matrix factorization for web-scale dyadic data analysis on MapReduce. In: WWW. (2010)
Google Scholar
Kang, U., Meeder, B., Faloutsos, C.: Spectral Analysis for Billion-Scale Graphs: Discoveries and Implementation. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011, Part II. LNCS, vol. 6635, pp. 13–25. Springer, Heidelberg (2011)
Chapter Google Scholar
Kang, U., Tsourakakis, C.E., Faloutsos, C.: PEGASUS: mining peta-scale graphs. Knowledge and Infomation Systems 27(2) (2011)
Google Scholar
Das, A., Datar, M., Garg, A., Rajaram, S.: Google news personalization: scalable online collaborative filtering. In: WWW (2007)
Google Scholar
Kim, Y., Shim, K.: TWITOBI: A recommendation system for twitter using probabilistic modeling. In: ICDM (2011)
Google Scholar
Wang, Y., Bai, H., Stanton, M., Chen, W.-Y., Chang, E.Y.: PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications. In: Goldberg, A.V., Zhou, Y. (eds.) AAIM 2009. LNCS, vol. 5564, pp. 301–314. Springer, Heidelberg (2009)
Chapter Google Scholar
Zhai, K., Boyd-Graber, J.L., Asadi, N., Alkhouja, M.L.: Mr. LDA: A flexible large scale topic modeling package using variational inference in MapReduce. In: WWW (2012)
Google Scholar
Cao, H., Jiang, D., Pei, J., Chen, E., Li, H.: Towards context-aware search by learning a very large variable length hidden markov model from search logs. In: WWW (2009)
Google Scholar
Jestes, J., Yi, K., Li, F.: Building wavelet histograms on large data in mapreduce. In: VLDB (2012)
Google Scholar
Siddharth Suri, S.V.: Counting triangles and the curse of the last reducer. In: WWW, pp. 607–614 (2011)
Google Scholar
Babu, S.: Towards automatic optimization of mapreduce programs. In: SOCC (2010)
Google Scholar
Jahani, E., Cafarella, M.J., Re, C.: Automatic optimization for mapreduce programs. In: VLDB (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Seoul National University, Seoul, Korea
Kyuseok Shim

Authors

Kyuseok Shim
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Graduate Department of Computer and Information Systems, University of Aizu, Ikki Machi, 965-8580, Aizu-Wakamatsu, Fukushima, Japan
Aastha Madaan , Shinji Kikuchi & Subhash Bhalla , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shim, K. (2013). MapReduce Algorithms for Big Data Analysis. In: Madaan, A., Kikuchi, S., Bhalla, S. (eds) Databases in Networked Information Systems. DNIS 2013. Lecture Notes in Computer Science, vol 7813. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37134-9_3

Download citation

DOI: https://doi.org/10.1007/978-3-642-37134-9_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37133-2
Online ISBN: 978-3-642-37134-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics