Skip to main content
Log in

OPTIMA: On-Line Partitioning Skew Mitigation for MapReduce with Resource Adjustment

  • Published:
Journal of Network and Systems Management Aims and scope Submit manuscript

Abstract

Partitioning skew has been shown to be a major issue that can significantly prolong the execution time of MapReduce jobs. Most of the existing off-line heuristics for partitioning skew mitigation are inefficient; they have to wait for the completion of all the map tasks. Some solutions can tackle this problem on-line, but will impose an additional overhead by repartitioning the workload of overloaded tasks. In this paper, we present OPTIMA, an on-line partitioning skew mitigation technique for MapReduce. OPTIMA predicts the workload distribution of reduce tasks at run-time, leverages the deviation detection technique to identify the overloaded tasks and pro-actively adjusts resource allocation for these tasks to reduce their execution time. We provide the upper bound of OPTIMA in time complexity, while allowing OPTIMA to perform totally on-line. Through experiments using both real and synthetic workloads running on an 11-node Hadoop cluster, we have observed OPTIMA can effectively mitigate the partitioning skew and improved the job completion time by up to 36.73 % in our experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. Since the memory requirement is related to the size of data, larger datasets are needed in order to clearly demonstrate the impact of memory allocation.

  2. Using compression in Hadoop to optimize MapReduce performance is prevalent in industry and academia [6, 27, 34].

References

  1. Ananthanarayanan, G., Hung, M.C.C., Ren, X., Stoica, I., Wierman, A., Yu, M.: Grass: trimming stragglers in approximation analytics. In: Proceedings of the 11th USENIX NSDI (2014)

  2. Ananthanarayanan, G., Kandula, S., Greenberg, A.G., Stoica, I., Lu, Y., Saha, B., Harris, E.: Reining in the outliers in map-reduce clusters using mantri. In: OSDI, vol. 10, p. 24. (2010)

  3. Arning, A., Agrawal, R., Raghavan, P.: A linear method for deviation detection in large databases. In: KDD, pp. 164–169. (1996)

  4. Bates, D.M., Watts, D.G.: Nonlinear Regression: Iterative Estimation and Linear Approximations. Wiley, New Jersey (1988)

    Google Scholar 

  5. Borthakur, D.: The hadoop distributed file system: architecture and design. Hadoop Proj. Website 11, 21 (2007)

    Google Scholar 

  6. Chen, Y., Ganapathi, A., Katz, R.H.: To compress or not to compress-compute vs. io tradeoffs for mapreduce energy efficiency. In: Proceedings of the First ACM SIGCOMM Workshop on Green Networking, pp. 23–28. ACM (2010)

  7. Chowdhury, M., Zaharia, M., Ma, J., Jordan, M.I., Stoica, I.: Managing data transfers in computer clusters with orchestra. In: ACM SIGCOMM Computer Communication Review, vol. 41, pp. 98–109. ACM (2011)

  8. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  9. Finch, T.: Incremental calculation of weighted mean and variance. University of Cambridge, Cambridge (2009)

    Google Scholar 

  10. Ghodsi, A., Zaharia, M., Hindman, B., Konwinski, A., Shenker, S., Stoica, I.: Dominant resource fairness: fair allocation of multiple resource types. In: NSDI, vol. 11, pp. 24–24 (2011)

  11. Gufler, B., Augsten, N., Reiser, A., Kemper, A.: Handing data skew in mapreduce. In: Proceedings of the 1st International Conference on Cloud Computing and Services Science, vol. 146, pp. 574–583 (2011)

  12. Gufler, B., Augsten, N., Reiser, A., Kemper, A.: Load balancing in mapreduce based on scalable cardinality estimates. In: Data Engineering (ICDE), 2012 IEEE 28th International Conference on, pp. 522–533. IEEE (2012)

  13. Hadoop mapreduce distribution http://hadoop.apache.org/docs/r1.2.1/

  14. Hadoop: Fair scheduler http://hadoop.apache.org/docs/r2.4.0/hadoop-yarn/hadoop-yarn-site/FairScheduler.html

  15. Hammoud, M., Rehman, M.S., Sakr, M.F.: Center-of-gravity reduce task scheduling to lower mapreduce network traffic. In: Cloud Computing (CLOUD), 2012 IEEE 5th International Conference on, pp. 49–58. IEEE (2012)

  16. Ibrahim, S., Jin, H., Lu, L., He, B., Antoniu, G., Wu, S.: Handling partitioning skew in mapreduce using leen. Peer Peer Netw. Appl. 6(4), 409–424 (2013)

    Article  Google Scholar 

  17. Jain, R., Chiu, D.M., Hawe, W.R.: A quantitative measure of fairness and discrimination for resource allocation in shared computer system (1984)

  18. Jalaparti, V., Ballani, H., Costa, P., Karagiannis, T., Rowstron, A.: Bridging the tenant-provider gap in cloud services. In: Proceedings of the Third ACM Symposium on Cloud Computing, p. 10. ACM (2012)

  19. Kang, J.M., Bannazadeh, H., Leon-Garcia, A.: Savi testbed: Control and management of converged virtual ict resources. In: IFIP/IEEE International Symposium on Integrated Network Management (IM 2013), 2013 pp. 664–667. IEEE (2013)

  20. Kirby, G.: Zipf’s law. UK J. Nav. Sci. 10(3), 180–185 (1985)

    Google Scholar 

  21. Kwon, Y., Balazinska, M., Howe, B., Rolia, J.: Skewtune: mitigating skew in mapreduce applications. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 25–36. ACM (2012)

  22. Le, Y., Liu, J., Ergun, F., Wang, D.: Online load balancing for mapreduce with skewed data input. In: INFOCOM, 2014 Proceedings IEEE, pp. 2004–2012. IEEE (2014)

  23. Lin, J.: Cloud 9: A mapreduce library for hadoop (2010a). https://github.com/lintool/Cloud9

  24. Lin, J., Dyer, C.: Data-intensive text processing with mapreduce. Synth. Lect. Hum. Lang. Technol. 3(1), 1–177 (2010b)

    Article  Google Scholar 

  25. Liu, Z., Zhang, Q., Zhani, M.F., Boutaba, R., Liu, Y., Gong, Z.: Dreams: Dynamic resource allocation for mapreduce with data skew. In: IFIP/IEEE International Symposium on Integrated Network Management (IM 2015), 2015. Ottawa (2015)

  26. Papadimitriou, C.H.: Computational Complexity. Wiley, New Jersey (2003)

    MATH  Google Scholar 

  27. Papers, M.I.W.: Compression in hadoop. http://technet.microsoft.com/en-us/library/dn247618.aspx

  28. Polo, J., Carrera, D., Becerra, Y., Torres, J., Ayguadé, E., Steinder, M., Whalley, I.: Performance-driven task co-scheduling for mapreduce environments. In: Network Operations and Management Symposium (NOMS), 2010 IEEE, pp. 373–380. IEEE (2010)

  29. Ramakrishnan, S.R., Swart, G., Urmanov, A.: Balancing reducer skew in mapreduce workloads using progressive sampling. In: Proceedings of the Third ACM Symposium on Cloud Computing, p. 16. ACM (2012)

  30. Sharma, B., Prabhakar, R., Lim, S., Kandemir, M.T., Das, C.R.: Mrorchestrator: A fine-grained resource orchestration framework for mapreduce clusters. In: IEEE 5th International Conference on Cloud Computing (CLOUD), 2012, pp. 1–8. IEEE (2012)

  31. Tan, P.N., Steinbach, M., Kumar, V., et al.: Introduction to Data Mining, vol. 1. Pearson Addison Wesley, Boston (2006)

    Google Scholar 

  32. Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., et al.: Apache hadoop yarn: Yet another resource negotiator. In: Proceedings of the 4th annual Symposium on Cloud Computing, p. 5. ACM (2013)

  33. Verma, A., Cherkasova, L., Campbell, R.H.: Aria: automatic resource inference and allocation for mapreduce environments. In: Proceedings of the 8th ACM International Conference on Autonomic Computing, pp. 235–244. ACM (2011)

  34. White, T.: Hadoop: The definitive guide. O’Reilly Media Inc, California (2012)

    Google Scholar 

  35. Wolf, J., Rajan, D., Hildrum, K., Khandekar, R., Kumar, V., Parekh, S., Wu, K.L., Balmin, A.: Flex: A slot allocation scheduling optimizer for mapreduce workloads. In: Middleware 2010, pp. 1–20. Springer (2010)

  36. Yadwadkar, N.J., Ananthanarayanan, G., Katz, R.: Wrangler: Predictable and faster jobs using fewer resources. In: Proceedings of the ACM Symposium on Cloud Computing, pp. 1–14. ACM (2014)

  37. Zacheilas, N., Kalogeraki, V.: Real-time scheduling of skewed mapreduce jobs in heterogeneous environments. In: Proceedings of 11th International Conference on Autonomic Computing, pp. 189–200. USENIX (2014)

  38. Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R.H., Stoica, I.: Improving mapreduce performance in heterogeneous environments. In: OSDI, vol. 8, p. 7 (2008)

  39. Zhang, Z., Cherkasova, L., Loo, B.T.: Autotune: Optimizing execution concurrency and resource usage in mapreduce workflows. In: ICAC, pp. 175–181 (2013)

Download references

Acknowledgments

This work is supported in part by the National Natural Science Foundation of China (No. 61472438), and in part by the Smart Applications on Virtual Infrastructure (SAVI) project funded under the National Sciences and Engineering Research Council of Canada (NSERC) Strategic Networks Grant Number NETGP394424-10.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Raouf Boutaba.

Additional information

Currently, Zhihong Liu is with University of Waterloo as a visiting student.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, Z., Zhang, Q., Boutaba, R. et al. OPTIMA: On-Line Partitioning Skew Mitigation for MapReduce with Resource Adjustment. J Netw Syst Manage 24, 859–883 (2016). https://doi.org/10.1007/s10922-015-9362-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10922-015-9362-8

Keywords

Navigation