Skip to main content

Cross-Phase Optimization in MapReduce

  • Chapter
  • First Online:
Cloud Computing for Data-Intensive Applications

Abstract

MapReduce has proven remarkably effective for a wide variety of data-intensive applications, but it was designed to run on large single-site homogeneous clusters. Researchers have begun to explore the extent to which the original MapReduce assumptions can be relaxed including skewed workloads, iterative applications, and heterogeneous computing environments. This chapter continues this exploration by applying MapReduce across widely distributed data over distributed computation resources. This problem arises when datasets are generated and stored at multiple sites as is common in many scientific domains and increasingly e-commerce applications. It also occurs when multi-site resources such as geographically separated data centers are applied to the same application. Using Hadoop, we show that the absence of network and node homogeneity and locality of data lead to poor performance. The problem is that interaction of MapReduce phases becomes pronounced in the presence of heterogeneous network behavior. In this paper, we propose new cross-phase optimization techniques that enable independent MapReduce phases to influence one another. We propose techniques that optimize the push and map phases to enable push-map overlap and to allow map behavior to feed back into push dynamics. Similarly, we propose techniques that optimize the map and reduce phases to enable shuffle cost to feed back and affect map scheduling decisions. We evaluate the benefits of our techniques in both Amazon EC2 and PlanetLab. The experimental results show the potential of these techniques as performance is improved from 7 to 18 % depending on the execution environment and application.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://newsroom.fb.com.

  2. 2.

    When this is not possible, tasks must read their inputs remotely, and scarce network bandwidth becomes a limiting factor.

  3. 3.

    This and the remaining figures and tables from this chapter are reproduced from [23].

  4. 4.

    To improve fault tolerance, we have also added an option to cache and replicate inputs at the compute nodes. This reduces the need to re-fetch remote data after task failures or for speculative execution.

  5. 5.

    Throughout this chapter, error bars indicate 95 % confidence intervals.

References

  1. Cloudera impala. Http://www.cloudera.com/impala

  2. Scalding. Http://github.com/twitter/scalding

  3. A conversation with Jim Gray. Queue 1(4), 8–17 (2003). DOI 10.1145/864056.864078. URL http://doi.acm.org/10.1145/864056.864078

  4. Ahmad, F., Chakradhar, S., Raghunathan, A., Vijaykumar, T.N.: Tarazu: Optimizing MapReduce on heterogeneous clusters. In: Proceedings of ASPLOS, pp. 61–74 (2012)

    Google Scholar 

  5. Ananthanarayanan, G., Kandula, S., Greenberg, A., Stoica, I., Lu, Y., Saha, B.: Reining in the outliers in map-reduce clusters using mantri. In: Proceedings of OSDI, pp. 265–278 (2010)

    Google Scholar 

  6. Arlitt, M., Jin, T.: Workload characterization of the 1998 World Cup Web Site. Tech. Rep. HPL-1999-35R1, HP Labs (1999)

    Google Scholar 

  7. Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R.H., Konwinsky, A., Lee, G., Patterson, D.A., Rabkin, A., Stoica, I., Zaharia, M.: Above the clouds: A Berkeley view of cloud computing. Tech. Rep. UCB/EECS-2009-28, Electrical Engineering and Computer Sciences, University of California at Berkeley (2009)

    Google Scholar 

  8. Babu, S.: Towards automatic optimization of MapReduce programs. In: Proceedings of ACM SoCC, pp. 137–142 (2010)

    Google Scholar 

  9. Baker, J., et al.: Megastore: Providing scalable, highly available storage for interactive services. In: Proceedings of CIDR, pp. 223–234 (2011)

    Google Scholar 

  10. BitTorrent. Http://www.bittorrent.com

  11. Cardosa, M., Wang, C., Nangia, A., Chandra, A., Weissman, J.: Exploring MapReduce efficiency with highly-distributed data. In: Proceedings of MapReduce, pp. 27–33 (2011)

    Google Scholar 

  12. Chen, Q., Zhang, D., Guo, M., Deng, Q., Guo, S., Guo, S.: SAMR: A self-adaptive MapReduce scheduling algorithm in heterogeneous environment. In: Proceedings of IEEE CIT, pp. 2736–2743 (2010)

    Google Scholar 

  13. Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce online. In: Proceedings of NSDI, pp. 313–327 (2010)

    Google Scholar 

  14. Corbett, J.C., et al.: Spanner: Google’s globally-distributed database. In: Proceedings of OSDI, pp. 251–264 (2012)

    Google Scholar 

  15. Costa, F., Silva, L., Dahlin, M.: Volunteer cloud computing: MapReduce over the internet. In: Proceedings of IEEE IPDPSW, pp. 1855–1862 (2011)

    Google Scholar 

  16. Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: Proceedings of OSDI, pp. 137–150 (2004)

    Google Scholar 

  17. Gadre, H., Rodero, I., Parashar, M.: Investigating MapReduce framework extensions for efficient processing of geographically scattered datasets. In: Proceedings of ACM SIGMETRICS, pp. 116–118 (2011)

    Google Scholar 

  18. Gates, A.F., Natkovich, O., Chopra, S., Kamath, P., Narayanamurthy, S.M., Olston, C., Reed, B., Srinivasan, S., Srivastava, U.: Building a high-level dataflow system on top of map-reduce: the pig experience. Proceedings of the VLDB Endowment 2(2), 1414–1425 (2009). URL http://dl.acm.org/citation.cfm?id=1687553.1687568

  19. GridFTP. Http://globus.org/toolkit/docs/3.2/gridftp/

  20. Free eBooks by Project Gutenberg. http://www.gutenberg.org/

  21. Hadoop. http://hadoop.apache.org

  22. Heintz, B., Chandra, A., Sitaraman, R.K.: Optimizing MapReduce for highly distributed environments. Tech. Rep. TR 12-003, Department of Computer Science and Engineering, University of Minnesota (2012)

    Google Scholar 

  23. Heintz, B., Wang, C., Chandra, A., Weissman, J.: Cross-phase optimization in MapReduce. In: IEEE International Conference on Cloud Engineering (IC2E), pp. 338–347 (2013)

    Google Scholar 

  24. Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F.B., Babu, S.: Starfish: A self-tuning system for big data analytics. In: Proceedings of CIDR, pp. 261–272 (2011)

    Google Scholar 

  25. Hey, T., Tansley, S., Tolle, K. (eds.): The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, Redmond, Washington (2009). URL http://research.microsoft.com/en-us/collaboration/fourthparadigm/

  26. Kim, S., Won, J., Han, H., Eom, H., Yeom, H.Y.: Improving Hadoop performance in intercloud environments. Proceedings of ACM SIGMETRICS 39(3), 107–109 (2011)

    Article  Google Scholar 

  27. Lin, H., Ma, X., Archuleta, J., Feng, W.c., Gardner, M., Zhang, Z.: MOON: MapReduce on opportunistic environments. In: Proceedings of ACM HPDC, pp. 95–106 (2010)

    Google Scholar 

  28. Luo, Y., Guo, Z., Sun, Y., Plale, B., Qiu, J., Li, W.W.: A hierarchical framework for cross-domain MapReduce execution. In: Proceedings of ECMLS, pp. 15–22 (2011)

    Google Scholar 

  29. Hadoop. http://mahout.apache.org

  30. Nygren, E., Sitaraman, R., Sun, J.: The Akamai network: A platform for high-performance internet applications. ACM SIGOPS Oper. Syst. Rev. 44(3), 2–19 (2010)

    Article  Google Scholar 

  31. Palanisamy, B., Singh, A., Liu, L., Jain, B.: Purlieus: locality-aware resource allocation for MapReduce in a cloud. In: Proceedings of SC, pp. 58:1–58:11 (2011)

    Google Scholar 

  32. Rabkin, A., Arye, M., Sen, S., Pai, V., Freedman, M.J.: Making every bit count in wide-area analytics. In: Proceedings of the 14th Workshop on Hot Topics in Operating Systems (2013)

    Google Scholar 

  33. Sandholm, T., Lai, K.: MapReduce optimization using dynamic regulated prioritization. In: Proceedings of ACM SIGMETRICS, pp. 299–310 (2009)

    Google Scholar 

  34. Satyanarayanan, M., Bahl, P., Caceres, R., Davies, N.: The case for VM-based cloudlets in mobile computing. IEEE Pervasive Computing 8, 14–23 (2009)

    Article  Google Scholar 

  35. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment 2, 1626–1629 (2009)

    Article  Google Scholar 

  36. Verma, A., Zea, N., Cho, B., Gupta, I., Campbell, R.H.: Breaking the MapReduce stage barrier. In: Proceedings of IEEE Cluster, pp. 235–244 (2010)

    Google Scholar 

  37. Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R.H., Stoica, I.: Improving MapReduce performance in heterogeneous environments. In: Proceedings of OSDI, pp. 29–42 (2008)

    Google Scholar 

Download references

Acknowledgements

The authors would like to acknowledge Professor Ramesh Sitaraman (ramesh@cs.umass.edu) for his contributions to our model-driven optimization approach (Sect. 2), as well as Chenyu Wang (chwang@cs.umn.edu) for his contributions to the cross-phase optimization techniques. They would further like to acknowledge NSF Grants IIS-0916425 and CNS-0643505, which supported this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Benjamin Heintz .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer Science+Business Media New York

About this chapter

Cite this chapter

Heintz, B., Chandra, A., Weissman, J. (2014). Cross-Phase Optimization in MapReduce. In: Li, X., Qiu, J. (eds) Cloud Computing for Data-Intensive Applications. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-1905-5_12

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-1905-5_12

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4939-1904-8

  • Online ISBN: 978-1-4939-1905-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics