Skip to main content
Log in

Improving MapReduce Performance with Partial Speculative Execution

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

The MapReduce framework has become the de facto standard for big data processing due to its attractive features and abilities. One is that it automatically parallelizes a job into multiple tasks and transparently handles task execution on a large cluster of commodity machines. The increasing heterogeneity of distributed environments may result in a few straggling tasks, which prolong job completion. Speculative execution is proposed to mitigate stragglers. However, the existing speculative execution mechanism could not work efficiently as many speculative tasks are still slower than their original tasks. In this paper, we explore an approach to increase the efficiency of speculative execution, and further improve MapReduce performance. We propose the Partial Speculative Execution (PSE) strategy to make speculative tasks start from the checkpoint. By leveraging the checkpoint of original tasks, PSE can eliminate the costs of re-reading, re-copying, and re-computing the processed data. We implement PSE in Hadoop, and evaluate its performance in terms of job completion time and the efficiency of speculative execution under several kinds of classical workloads. Experimental results show that, in heterogeneous environments with stragglers, PSE completes jobs 56 % faster than that with no speculation and 12 % faster than that with LATE, an improved speculative execution algorithm. In addition, on average PSE can improve the efficiency of speculative execution by 24 % compared to LATE.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Ananthanarayanan, G., Kandula, S., Greenberg, A.G., Stoica, I., Lu, Y., Saha, B., Harris, E.: Reining in the outliers in map-reduce clusters using mantri. In: OSDI, 10, 24 (2010)

  2. Ananthanarayanan, G., Ghodsi, A., Shenker, S., Stoica, I.: Effective straggler mitigation: attack of the clones (2013)

  3. Apache: Apache hadoop. http://hadoop.apache.org/ (2014a)

  4. Apache: Apache zookeeper. http://zookeeper.apache.org/ (2014b)

  5. Benjamin Gufler ARAK Nikolaus Agustine: Handling data skew in mapreduce (2011)

  6. Bhatotia, P., Wieder, A., Rodrigues, R., Acar, U.A., Pasquin, R.: Incoop: Mapreduce for incremental computations. In: Proceedings of the 2nd ACM Symposium on Cloud Computing, ACM, p 7 (2011)

  7. Chen, Q., Zhang, D., Guo, M., Deng, Q., Guo, S.: Samr: A self-adaptive mapreduce scheduling algorithm in heterogeneous environment. In: Computer and Information Technology (CIT), 2010 IEEE 10th International Conference on, IEEE, 2736–2743 (2010)

  8. Cho, B., Rahman, M., Chajed, T., Gupta, I., Abad, C., Roberts, N., Lin, P.: Natjam: Design and evaluation of eviction policies for supporting priorities and deadlines in mapreduce clusters. In: Proceedings of the 4th annual Symposium on Cloud Computing, ACM, 6 (2013)

  9. Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: Mapreduce online (2010)

  10. Curino, C.: [mapreduce-5197]checkpoint service: a library component to facilitate checkpoint of task state. https://issues.apache.org/jira/browse/MAPREDUCE-5197 (2013)

  11. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  12. Doulkeridis, C., Nørvåg, K.: A survey of large-scale analytical query processing in mapreduce. The VLDB Journal 1–26 (2013)

  13. Elmeleegy, K.: Piranha: Optimizing short jobs in hadoop. Proc. VLDB Endowment 6(11), 985–996 (2013)

    Article  Google Scholar 

  14. Eltabakh, M.Y., Özcan, F., Sismanis, Y., Haas, P.J., Pirahesh, H., Vondrak, J.: Eagle-eyed elephant: split-oriented indexing in hadoop. In: Proceedings of the 16th International Conference on Extending Database Technology, ACM, 89–100 (2013)

  15. Grover, R., Carey, M.J.: Extending map-reduce for efficient predicate-based sampling. In: Data Engineering (ICDE), 2012 IEEE 28th International Conference on, IEEE, 486–497 (2012)

  16. Gu, R., Yang, X., Yan, J., Sun, Y., Wang, B., Yuan, C., Huang, Y.: Shadoop: Improving mapreduce performance by optimizing job execution mechanism in hadoop clusters. J. Parallel Distrib. Comput. 74(3), 2166–2179 (2014)

    Article  Google Scholar 

  17. Gufler, B., Augsten, N., Reiser, A., Kemper, A.: Load balancing in mapreduce based on scalable cardinality estimates. In: Data Engineering (ICDE), 2012 IEEE 28th International Conference on, IEEE, 522–533 (2012)

  18. Guo, Y., Rao, J., Zhou, X.: Ishuffle: Improving hadoop performance with shuffle-on-write. 10th International Conference on Autonomic Computing 107–117 (2013)

  19. Harringer, M.: Xen-the art of virtualization (2004)

  20. Herodotou, H., Dong, F., Babu, S.: No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics. In: Proceedings of the 2nd ACM Symposium on Cloud Computing, ACM, 18 (2011)

  21. Hsu, C.H., Lin, C.C., Ts, Hsu: Adaptable scheduling algorithm for grids with resource redeployment capability. J. Grid Computing 12(3), 447–463 (2014)

    Article  Google Scholar 

  22. Hueske, F., Peters, M., Sax, M.J., Rheinländer, A., Bergmann, R., Krettek, A., Tzoumas, K.: Opening the black boxes in data flow optimization. Proc. VLDB Endowment 5(11), 1256–1267 (2012)

    Article  Google Scholar 

  23. Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. ACM SIGOPS Oper. Syst. Rev. 41(3), 59–72 (2007)

    Article  Google Scholar 

  24. Kwon, Y., Balazinska, M., Howe, B., Rolia, J.: Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In: Proceedings of the 1st ACM symposium on Cloud computing, ACM, 75–86 (2010)

  25. Kwon, Y., Balazinska, M., Howe, B., Rolia, J., A study of skew in mapreduce applications. Open Cirrus Summit (2011)

  26. Kwon, Y., Balazinska, M., Howe, B., Rolia, J.: Skewtune: mitigating skew in mapreduce applications. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, ACM, 25–36 (2012)

  27. Kwon, Y., Ren, K., Balazinska, M., Howe, B., Rolia, J.: Managing skew in hadoop. IEEE Data Eng Bull 36(1), 24–33 (2013)

    Google Scholar 

  28. Laptev, N., Zeng, K., Zaniolo, C.: Early accurate results for advanced analytics on mapreduce. Proc. VLDB Endowment 5(10), 1028–1039 (2012)

    Article  Google Scholar 

  29. Lim, H., Herodotou, H., Babu, S.: Stubby: A transformation-based optimizer for mapreduce workflows. Proc. VLDB Endowment 5(11), 1196–1207 (2012)

    Article  Google Scholar 

  30. Logothetis, D., Olston, C., Reed, B., Webb, K.C., Yocum, K.: Stateful bulk processing for incremental analytics (2010)

  31. Olston, C., Chiou, G., Chitnis, L., Liu, F., Han, Y., Larsson, M., Neumann, A., Rao, V.B., Sankarasubramanian, V., Seth, S., et al.: Nova: continuous pig/hadoop workflows. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, ACM, 1081–1090 (2011)

  32. Onizuka, M., Kato, H., Hidaka, S., Nakano, K., Hu, Z.: Optimization for iterative queries on mapreduce. Proc. VLDB Endowment 7(4) (2013)

  33. Quiané-Ruiz, J.A., Pinkel, C., Schad, J., Dittrich, J.: Rafting mapreduce: Fast recovery on the raft. In: Data Engineering (ICDE), 2011 IEEE 27th International Conference on, IEEE, 589–600 (2011)

  34. Qureshi, M.B., Dehnavi, M.M., Min-Allah, N., Qureshi, M.S., Hussain, H., Rentifis, I., Tziritas, N., Loukopoulos, T., Khan, S.U., Xu, C.Z., et al.: Survey on grid resource allocation mechanisms. J. Grid Computing 12(2), 399–441 (2014)

    Article  Google Scholar 

  35. Ramakrishnan, S.R., Swart, G., Urmanov, A.: Balancing reducer skew in mapreduce workloads using progressive sampling. In: Proceedings of the Third ACM Symposium on Cloud Computing, ACM, 16 (2012)

  36. Rao, S., Ramakrishnan, R., Silberstein, A., Ovsiannikov, M., Reeves, D.: Sailfish: A framework for large scale data processing. In: Proceedings of the Third ACM Symposium on Cloud Computing, ACM, 4 (2012)

  37. Rasmussen, A., Conley, M., Porter, G., Kapoor, R., Vahdat, A., et al.: Themis: an i/o-efficient mapreduce. In: Proceedings of the Third ACM Symposium on Cloud Computing, ACM, 13 (2012)

  38. Rasooli, A., Down, D.G.: Guidelines for selecting hadoop schedulers based on system heterogeneity. J. Grid Computing 12(3), 499–519 (2014)

    Article  Google Scholar 

  39. Reiss, C., Tumanov, A., Ganger, G.R., Katz, R.H., Kozuch, M.A.: Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In: Proceedings of the Third ACM Symposium on Cloud Computing, ACM, 7 (2012)

  40. Schad, J., Quianee-Ruiz, J. A., Dittrich, J.: Elephant, do not forget everything! efficient processing of growing datasets. In: Cloud Computing (CLOUD), 2013 IEEE Sixth International Conference on, IEEE, 252–259 (2013)

  41. Sun, X., He, C., Lu, Y.: Esamr: An enhanced self-adaptive mapreduce scheduling algorithm. In: Parallel and Distributed Systems (ICPADS), 2012 IEEE 18th International Conference on, IEEE, 148–155 (2012)

  42. Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., et al.: Apache hadoop yarn: Yet another resource negotiator. In: Proceedings of the 4th annual Symposium on Cloud Computing, ACM, 5 (2013)

  43. Vernica, R., Balmin, A., Beyer, K.S., Ercegovac, V.: Adaptive mapreduce using situation-aware mappers. In: Proceedings of the 15th International Conference on Extending Database Technology, ACM, 420–431 (2012)

  44. Wang, W., Zeng, G.: Bayesian cognitive model in scheduling algorithm for data intensive computing. J. Grid Computing 10(1), 173–184 (2012)

    Article  Google Scholar 

  45. Wolf, J., Rajan, D., Hildrum, K., Khandekar, R., Kumar, V., Parekh, S., Wu, K.L., Balmin, A.: Flex: A slot allocation scheduling optimizer for mapreduce workloads. In: Middleware 2010, Springer, 1–20 (2010)

  46. Wu, S., Li, F., Mehrotra, S., Ooi, B.C.: Query optimization for massively parallel data processing. In: Proceedings of the 2nd ACM Symposium on Cloud Computing, ACM, 12 (2011)

  47. Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R.H., Stoica, I.: Improving mapreduce performance in heterogeneous environments (2008)

  48. Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K., Shenker, S., Stoica, I.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: Proceedings of the 5th European conference on Computer systems, ACM, 265–278 (2010)

  49. Zhang, J., Zhou, H., Chen, R., Fan, X., Guo, Z., Lin, H., Li, J.Y., Lin, W., Zhou, J., Zhou, L.: Optimizing data shuffling in data-parallel computation by understanding user-defined functions (2012a)

  50. Zhang, Y, Gao, Q, Gao, L, Wang, C.: Imapreduce: A distributed computing framework for iterative computation. J. Grid Computing 10(1), 47–68 (2012b)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weiming Lu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Y., Lu, W., Lou, R. et al. Improving MapReduce Performance with Partial Speculative Execution. J Grid Computing 13, 587–604 (2015). https://doi.org/10.1007/s10723-015-9350-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10723-015-9350-y

Keywords

Navigation