Skip to main content

MapReduce Algorithms for Big Data Analysis

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7813))

Abstract

As there is an increasing trend of applications being expected to deal with big data that usually do not fit in the main memory of a single machine, analyzing big data is a challenging problem today. For such data-intensive applications, the MapReduce framework has recently attracted considerable attention and started to be investigated as a cost effective option to implement scalable parallel algorithms for big data analysis which can handle petabytes of data for millions of users. MapReduce is a programming model that allows easy development of scalable parallel applications to process big data on large clusters of commodity machines. Google’s MapReduce or its open-source equivalent Hadoop is a powerful tool for building such applications.

In this tutorial, we will introduce the MapReduce framework based on Hadoop and present the state-of-the-art in MapReduce algorithms for query processing, data analysis and data mining. The intended audience of this tutorial is professionals who plan to design and develop MapReduce algorithms and researchers who should be aware of the state-of-the-art in MapReduce algorithms available today for big data analysis.

A full version of this tutorial was previously presented in VLDB 2012.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: OSDI (2004)

    Google Scholar 

  2. Apache: Apache Hadoop (2010), http://hadoop.apache.org

  3. Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in MapReduce. In: SIGMOD (2010)

    Google Scholar 

  4. Okcan, A., Riedewald, M.: Processing theta-joins using MapReduce. In: SIGMOD (2011)

    Google Scholar 

  5. Baraglia, R., Morales, G.D.F., Lucchese, C.: Document similarity self-join with MapReduce. In: ICDM (2010)

    Google Scholar 

  6. Kim, Y., Shim, K.: Parallel top-k similarity join algorithms using MapReduce. In: ICDE (2012)

    Google Scholar 

  7. Metwally, A., Faloutsos, C.: V-SMART-Join: A scalable MapReduce framework for all-pair similarity joins of multisets and vectors. In: VLDB (2012)

    Google Scholar 

  8. Elsayed, T., Lin, J., Oard, D.W.: Pairwise document similarity in large collections with MapReduce. In: HLT (2008)

    Google Scholar 

  9. Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: SIGMOD (2010)

    Google Scholar 

  10. Afrati, F., Ullman, J.D.: Optimizing joins in a Map-Reduce environment. In: VLDB (2009)

    Google Scholar 

  11. Chen, L., Zhang, X., Wang, M.: Efficient multiwaytheta join processing using mapreduce. VLDB (2012)

    Google Scholar 

  12. Wu, S., Li, F., Mehrotra, S., Ooi, B.C.: Query optimization for massively parallel data processin. In: SOCC (2011)

    Google Scholar 

  13. Sun, T., Shuy, C., Liy, F., Yuy, H., Ma, L., Fang, Y.: An efficient hierarchical clustering method for large datasets with Map-Reduce. In: PDCAT (2009)

    Google Scholar 

  14. He, Y., Tan, H., Luo, W., Mao, H., Ma, D., Feng, S., Fan, J.: Mr-dbscan: An efficient parallel density-based clustering algorithm using MapReduce. In: ICPADS (2011)

    Google Scholar 

  15. Deodhar, M., Jones, C., Ghosh, J.: Parallel simultaneous co-clustering and learning with Map-Reduce. In: GrC (2000)

    Google Scholar 

  16. Papadimitriou, S., Sun, J.: DisCo: Distributed co-clustering with Map-Reduce: A case study towards petabyte-scale end-to-end mining. In: ICDM (2008)

    Google Scholar 

  17. Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.: PFP: Parallel FP-Growth for query recommendation. ACM Recommender Systems (2008)

    Google Scholar 

  18. Panda, B., Herbach, J.S., Basu, S., Bayardo, R.J.: Planet: Massively parallel learning of tree ensembles with MapReduce. In: VLDB (2012)

    Google Scholar 

  19. Liu, C.,Yang, H.-C., J.F.L.W.H.Y.M.W.: Distributed nonnegative matrix factorization for web-scale dyadic data analysis on MapReduce. In: WWW. (2010)

    Google Scholar 

  20. Kang, U., Meeder, B., Faloutsos, C.: Spectral Analysis for Billion-Scale Graphs: Discoveries and Implementation. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011, Part II. LNCS, vol. 6635, pp. 13–25. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  21. Kang, U., Tsourakakis, C.E., Faloutsos, C.: PEGASUS: mining peta-scale graphs. Knowledge and Infomation Systems 27(2) (2011)

    Google Scholar 

  22. Das, A., Datar, M., Garg, A., Rajaram, S.: Google news personalization: scalable online collaborative filtering. In: WWW (2007)

    Google Scholar 

  23. Kim, Y., Shim, K.: TWITOBI: A recommendation system for twitter using probabilistic modeling. In: ICDM (2011)

    Google Scholar 

  24. Wang, Y., Bai, H., Stanton, M., Chen, W.-Y., Chang, E.Y.: PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications. In: Goldberg, A.V., Zhou, Y. (eds.) AAIM 2009. LNCS, vol. 5564, pp. 301–314. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  25. Zhai, K., Boyd-Graber, J.L., Asadi, N., Alkhouja, M.L.: Mr. LDA: A flexible large scale topic modeling package using variational inference in MapReduce. In: WWW (2012)

    Google Scholar 

  26. Cao, H., Jiang, D., Pei, J., Chen, E., Li, H.: Towards context-aware search by learning a very large variable length hidden markov model from search logs. In: WWW (2009)

    Google Scholar 

  27. Jestes, J., Yi, K., Li, F.: Building wavelet histograms on large data in mapreduce. In: VLDB (2012)

    Google Scholar 

  28. Siddharth Suri, S.V.: Counting triangles and the curse of the last reducer. In: WWW, pp. 607–614 (2011)

    Google Scholar 

  29. Babu, S.: Towards automatic optimization of mapreduce programs. In: SOCC (2010)

    Google Scholar 

  30. Jahani, E., Cafarella, M.J., Re, C.: Automatic optimization for mapreduce programs. In: VLDB (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Shim, K. (2013). MapReduce Algorithms for Big Data Analysis. In: Madaan, A., Kikuchi, S., Bhalla, S. (eds) Databases in Networked Information Systems. DNIS 2013. Lecture Notes in Computer Science, vol 7813. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37134-9_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-37134-9_3

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-37133-2

  • Online ISBN: 978-3-642-37134-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics