Skip to main content
Log in

AEGEUS++: an energy-aware online partition skew mitigation algorithm for mapreduce in cloud

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

This paper investigates the partition skew problem at reduce phase in the mapreduce jobs. Our study summarize the skew problem in both offline and online manner. Offline is a heuristics based approach waits for the completion of map tasks and it involves computation overhead to estimate the partition size. In online approach, the overloaded tasks are distributed across other nodes that needs extra split and merge operations. These extra operations and ineffective utilization of resources in turn hamper the performance of the entire system. In this paper, we propose Aegeus++, to address the skew mitigation and adaptive data sampling problems for mapreduce jobs which enables to build an online prediction model with improved accuracy in minimal waiting time. In addition, we propose near linear skew detection and fine-grained Resource Allocation algorithms for identifying the skewed partition and allocating appropriate resources to reducers based on the partition size. Finally, our energy-aware opportunistic frequency tuning algorithm improves the performance of the reducer container on-fly, that can process the skewed data faster with minimal energy consumption. We evaluated Aegeus++ in the cloud setup by using benchmark datasets, compared its performance with native Hadoop and its other approaches. Based on our observation, Aegeus++ outperforms native Hadoop by 44% by maximizing its overall performance of the application and decreases the energy consumption by 37.67% when compared with existing approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Ahmad, F., Lee, S., Thottethodi, M., Vijaykumar, T.: Puma: Purdue mapreduce benchmarks suite (2012)

  2. Ananthanarayanan, G., Kandula, S., Greenberg, A.G., Stoica, I., Lu, Y., Saha, B., Harris, E.: Reining in the outliers in map-reduce clusters using mantri. In: OSDI, vol. 10, p. 24 (2010)

  3. Bulmer, M.G.: Principles of Statistics. Courier Corporation, Mineola (1979)

    MATH  Google Scholar 

  4. Chen, Q., Yao, J., Xiao, Z.: Libra: lightweight data skew mitigation in mapreduce. IEEE Trans Parallel Distrib. Syst. 26(9), 2520–2533 (2015)

    Article  Google Scholar 

  5. Company, M. http://www.mckinsey.com/business-functions/business-technology/our-insights/the-need-to-lead-in-data-and-analytics. Accessed 10 May 2016 (2016)

  6. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  7. Dhawalia, P., Kailasam, S., Janakiram, D.: Chisel: A resource savvy approach for handling skew in mapreduce applications. In: 2013 IEEE Sixth International Conference on Cloud Computing, pp. 652–660. IEEE (2013)

  8. Dhawalia, P., Kailasam, S., Janakiram, D.: Chisel++: handling partitioning skew in mapreduce framework using efficient range partitioning technique. In: Proceedings of the Sixth International Workshop on Data Intensive Distributed Computing, pp. 21–28. ACM (2014)

  9. Elmeleegy, K., Olston, C., Reed, B.: Spongefiles: Mitigating data skew in mapreduce using distributed memory. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 551–562. ACM (2014)

  10. Greenberg, A., Hamilton, J., Maltz, D.A., Patel, P.: The cost of a cloud: research problems in data center networks. ACM SIGCOMM Comput. Commun. Rev. 39(1), 68–73 (2008)

    Article  Google Scholar 

  11. Hackenberg, D., Schöne, R., Ilsche, T., Molka, D., Schuchart, J., Geyer, R.: An energy efficiency feature survey of the intel haswell processor. In: Parallel and Distributed Processing Symposium Workshop (IPDPSW), 2015 IEEE International, pp. 896–904. IEEE (2015)

  12. Hadoop, A. https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/hdfsdesign.html

  13. Hammoud, M., Sakr, M.F.: Locality-aware reduce task scheduling for mapreduce. In: Cloud Computing Technology and Science (CloudCom), 2011 IEEE Third International Conference on, pp. 570–576. IEEE (2011)

  14. Hartog, J., Dede, E., Govindaraju, M.: Mapreduce framework energy adaptation via temperature awareness. Cluster Comput. 17(1), 111–127 (2014)

    Article  Google Scholar 

  15. Ibrahim, S., Jin, H., Lu, L., Wu, S., He, B., Qi, L.: Leen: Locality/fairness-aware key partitioning for mapreduce in the cloud. In: Cloud Computing Technology and Science (CloudCom), 2010 IEEE Second International Conference on, pp. 17–24. IEEE (2010)

  16. Ibrahim, S., Moise, D., Chihoub, H.E., Carpen-Amarie, A., Bougé, L., Antoniu, G.: Towards efficient power management in mapreduce: investigation of cpu-frequencies scaling on power efficiency in hadoop. In: International Workshop on Adaptive Resource Management and Scheduling for Cloud Computing, pp. 147–164. Springer, Berlin (2014)

  17. Intel: Intel xeon e5-e3 v3 spec update. Accessed 4 Jan 2017 (2017)

  18. Jain, R., Chiu, D.M., Hawe, W.R.: A quantitative measure of fairness and discrimination for resource allocation in shared computer system, vol. 38. Eastern Research Laboratory, Digital Equipment Corporation, Hudson (1984)

    Google Scholar 

  19. Kaushik, R.T., Bhandarkar, M.: Greenhdfs: towards an energy-conserving, storage-efficient, hybrid hadoop compute cluster. In: Proceedings of the USENIX annual technical conference, p. 109 (2010)

  20. Kim, W., Shin, D., Yun, H.S., Kim, J., Min, S.L.: Performance comparison of dynamic voltage scaling algorithms for hard real-time systems. In: Real-Time and Embedded Technology and Applications Symposium, 2002. Proceedings. Eighth IEEE, pp. 219–228. IEEE (2002)

  21. Kumaresan, V., Baskaran, R.: Aegeus: An online partition skew mitigation algorithm for mapreduce. In: Proceedings of the International Conference on Informatics and Analytics, p. 100. ACM (2016)

  22. Komarasamy, D., Muthuswamy, V.: Deadline constrained adaptive multilevel scheduling system in cloud environment. KSII Trans. Internet Inf. Syst. (TIIS) 9(4), 1302–1320 (2015)

  23. Kwon, Y., Balazinska, M., Howe, B., Rolia, J.: Skewtune: mitigating skew in mapreduce applications. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 25–36. ACM (2012)

  24. Le, Y., Liu, J., Ergün, F., Wang, D.: Online load balancing for mapreduce with skewed data input. In: IEEE INFOCOM 2014-IEEE Conference on Computer Communications, pp. 2004–2012. IEEE (2014)

  25. Leverich, J., Kozyrakis, C.: On the energy (in) efficiency of hadoop clusters. ACM SIGOPS Oper. Syst. Rev. 44(1), 61–65 (2010)

    Article  Google Scholar 

  26. Li, P., Ju, L., Jia, Z., Sun, Z.: Sla-aware energy-efficient scheduling scheme for hadoop yarn. In: High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on, pp. 623–628. IEEE (2015)

  27. Liu, Z., Zhang, Q., Boutaba, R., Liu, Y., Wang, B.: Optima: on-line partitioning skew mitigation for mapreduce with resource adjustment. J. Netw. Syst. Manag. 25, 859–883 (2016)

    Article  Google Scholar 

  28. Liu, Z., Zhang, Q., Zhani, M.F., Boutaba, R., Liu, Y., Gong, Z.: Dreams: dynamic resource allocation for mapreduce with data skew. In: 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM), pp. 18–26. IEEE (2015)

  29. Payberah, A.H., Kavalionak, H., Kumaresan, V., Montresor, A., Haridi, S.: Clive: cloud-assisted p2p live streaming. In: Peer-to-Peer Computing (P2P), 2012 IEEE 12th International Conference on, pp. 79–90. IEEE (2012)

  30. Riquelme, C., Zhang, B., Johari, R.: Online active linear regression via thresholding. arXiv:1602.02845 (2016)

  31. Stack, O. https://www.openstack.org/

  32. Van Heddeghem, W., Lambert, S., Lannoo, B., Colle, D., Pickavet, M., Demeester, P.: Trends in worldwide ict electricity consumption from 2007 to 2012. Comput. Commun. 50, 64–76 (2014)

    Article  Google Scholar 

  33. Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., et al.: Apache hadoop yarn: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, p. 5. ACM (2013)

  34. vCloud. http://www.vcloudnews.com/every-day-big-data-statistics-2-5-quintillion-bytes-of-data-created-daily. Accessed 10 May 2016 (2016)

  35. Verma, A., Cherkasova, L., Campbell, R.H.: Aria: automatic resource inference and allocation for mapreduce environments. In: Proceedings of the 8th ACM International Conference on Autonomic Computing, pp. 235–244. ACM (2011)

  36. Wang, G., Wang, S., Luo, B., Shi, W., Zhu, Y., Yang, W., Hu, D., Huang, L., Jin, X., Xu, W.: Increasing large-scale data center capacity by statistical power control. In: Proceedings of the Eleventh European Conference on Computer Systems, p. 8. ACM (2016)

  37. Wirtz, T., Ge, R.: Improving mapreduce energy efficiency for computation intensive workloads. In: Green Computing Conference and Workshops (IGCC), 2011 International, pp. 1–8. IEEE (2011)

  38. Zaheilas, N., Kalogeraki, V.: Real-time scheduling of skewed mapreduce jobs in heterogeneous environments. In: 11th International Conference on Autonomic Computing (ICAC 14), pp. 189–200 (2014)

  39. Zhang, Z., Feng, X.: New methods for deviation-based outlier detection in large database. In: Fuzzy Systems and Knowledge Discovery, 2009. FSKD’09. Sixth International Conference on, vol. 1, pp. 495–499. IEEE (2009)

Download references

Acknowledgements

This work is supported by Anna Centenary Research Fellowship (CFR/ACRF/2015/15) which is funded by Anna University. Special Thanks to Microsoft for providing Microsoft Azure sponsorship Award for conducting our research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vimalkumar Kumaresan.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kumaresan, V., Baskaran, R. & Dhavachelvan, P. AEGEUS++: an energy-aware online partition skew mitigation algorithm for mapreduce in cloud. Cluster Comput 21, 1243–1260 (2018). https://doi.org/10.1007/s10586-017-1044-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-017-1044-8

Keywords

Navigation