Abstract
As there is an increasing trend of applications being expected to deal with big data that usually do not fit in the main memory of a single machine, analyzing big data is a challenging problem today. For such data-intensive applications, the MapReduce framework has recently attracted considerable attention and started to be investigated as a cost effective option to implement scalable parallel algorithms for big data analysis which can handle petabytes of data for millions of users. MapReduce is a programming model that allows easy development of scalable parallel applications to process big data on large clusters of commodity machines. Google’s MapReduce or its open-source equivalent Hadoop is a powerful tool for building such applications.
In this tutorial, we will introduce the MapReduce framework based on Hadoop and present the state-of-the-art in MapReduce algorithms for query processing, data analysis and data mining. The intended audience of this tutorial is professionals who plan to design and develop MapReduce algorithms and researchers who should be aware of the state-of-the-art in MapReduce algorithms available today for big data analysis.
A full version of this tutorial was previously presented in VLDB 2012.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: OSDI (2004)
Apache: Apache Hadoop (2010), http://hadoop.apache.org
Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in MapReduce. In: SIGMOD (2010)
Okcan, A., Riedewald, M.: Processing theta-joins using MapReduce. In: SIGMOD (2011)
Baraglia, R., Morales, G.D.F., Lucchese, C.: Document similarity self-join with MapReduce. In: ICDM (2010)
Kim, Y., Shim, K.: Parallel top-k similarity join algorithms using MapReduce. In: ICDE (2012)
Metwally, A., Faloutsos, C.: V-SMART-Join: A scalable MapReduce framework for all-pair similarity joins of multisets and vectors. In: VLDB (2012)
Elsayed, T., Lin, J., Oard, D.W.: Pairwise document similarity in large collections with MapReduce. In: HLT (2008)
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: SIGMOD (2010)
Afrati, F., Ullman, J.D.: Optimizing joins in a Map-Reduce environment. In: VLDB (2009)
Chen, L., Zhang, X., Wang, M.: Efficient multiwaytheta join processing using mapreduce. VLDB (2012)
Wu, S., Li, F., Mehrotra, S., Ooi, B.C.: Query optimization for massively parallel data processin. In: SOCC (2011)
Sun, T., Shuy, C., Liy, F., Yuy, H., Ma, L., Fang, Y.: An efficient hierarchical clustering method for large datasets with Map-Reduce. In: PDCAT (2009)
He, Y., Tan, H., Luo, W., Mao, H., Ma, D., Feng, S., Fan, J.: Mr-dbscan: An efficient parallel density-based clustering algorithm using MapReduce. In: ICPADS (2011)
Deodhar, M., Jones, C., Ghosh, J.: Parallel simultaneous co-clustering and learning with Map-Reduce. In: GrC (2000)
Papadimitriou, S., Sun, J.: DisCo: Distributed co-clustering with Map-Reduce: A case study towards petabyte-scale end-to-end mining. In: ICDM (2008)
Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.: PFP: Parallel FP-Growth for query recommendation. ACM Recommender Systems (2008)
Panda, B., Herbach, J.S., Basu, S., Bayardo, R.J.: Planet: Massively parallel learning of tree ensembles with MapReduce. In: VLDB (2012)
Liu, C.,Yang, H.-C., J.F.L.W.H.Y.M.W.: Distributed nonnegative matrix factorization for web-scale dyadic data analysis on MapReduce. In: WWW. (2010)
Kang, U., Meeder, B., Faloutsos, C.: Spectral Analysis for Billion-Scale Graphs: Discoveries and Implementation. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011, Part II. LNCS, vol. 6635, pp. 13–25. Springer, Heidelberg (2011)
Kang, U., Tsourakakis, C.E., Faloutsos, C.: PEGASUS: mining peta-scale graphs. Knowledge and Infomation Systems 27(2) (2011)
Das, A., Datar, M., Garg, A., Rajaram, S.: Google news personalization: scalable online collaborative filtering. In: WWW (2007)
Kim, Y., Shim, K.: TWITOBI: A recommendation system for twitter using probabilistic modeling. In: ICDM (2011)
Wang, Y., Bai, H., Stanton, M., Chen, W.-Y., Chang, E.Y.: PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications. In: Goldberg, A.V., Zhou, Y. (eds.) AAIM 2009. LNCS, vol. 5564, pp. 301–314. Springer, Heidelberg (2009)
Zhai, K., Boyd-Graber, J.L., Asadi, N., Alkhouja, M.L.: Mr. LDA: A flexible large scale topic modeling package using variational inference in MapReduce. In: WWW (2012)
Cao, H., Jiang, D., Pei, J., Chen, E., Li, H.: Towards context-aware search by learning a very large variable length hidden markov model from search logs. In: WWW (2009)
Jestes, J., Yi, K., Li, F.: Building wavelet histograms on large data in mapreduce. In: VLDB (2012)
Siddharth Suri, S.V.: Counting triangles and the curse of the last reducer. In: WWW, pp. 607–614 (2011)
Babu, S.: Towards automatic optimization of mapreduce programs. In: SOCC (2010)
Jahani, E., Cafarella, M.J., Re, C.: Automatic optimization for mapreduce programs. In: VLDB (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Shim, K. (2013). MapReduce Algorithms for Big Data Analysis. In: Madaan, A., Kikuchi, S., Bhalla, S. (eds) Databases in Networked Information Systems. DNIS 2013. Lecture Notes in Computer Science, vol 7813. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37134-9_3
Download citation
DOI: https://doi.org/10.1007/978-3-642-37134-9_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37133-2
Online ISBN: 978-3-642-37134-9
eBook Packages: Computer ScienceComputer Science (R0)