Cluster Computing

, Volume 16, Issue 1, pp 191–206 | Cite as

Breaking the MapReduce stage barrier

  • Abhishek Verma
  • Brian Cho
  • Nicolas Zea
  • Indranil Gupta
  • Roy H. Campbell
Article

Abstract

The MapReduce model uses a barrier between the Map and Reduce stages. This provides simplicity in both programming and implementation. However, in many situations, this barrier hurts performance because it is overly restrictive. Hence, we develop a method to break the barrier in MapReduce in a way that improves efficiency. Careful design of our barrier-less MapReduce framework results in equivalent generality and retains ease of programming. We motivate our case with, and experimentally study our barrier-less techniques in, a wide variety of MapReduce applications divided into seven classes. Our experiments show that our approach can achieve better job completion times than a traditional MapReduce framework. This is due primarily to the interleaving of I/O and computation, and forgoing disk-intensive work. We achieve a reduction in job completion times that is 25% on average and 87% in the best case.

Keywords

MapReduce Data-intensive computing 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Apache hadoop. http://hadoop.apache.org (2011)
  2. 2.
    Ananthanarayanan, G., Kandula, S., Greenberg, A., Stoica, I., Lu, Y., Saha, B., Harris, E.: Reining in the outliers in map-reduce clusters using mantri. In: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, OSDI’10, pp. 1–16. USENIX Association, Berkeley (2010) Google Scholar
  3. 3.
    Black, F., Scholes, M.S.: The pricing of options and corporate liabilities. J. Polit. Econ. 81(3), 637–654 (1973) CrossRefGoogle Scholar
  4. 4.
    Brants, T., Popat, A.C., Xu, P., Och, F.J., Dean, J.: Large language models in machine translation. In: Proc. of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 858–867 (2007) Google Scholar
  5. 5.
    Chu, C., Kim, S., Lin, Y., Yu, Y., Bradski, G., Ng, A., Olukotun, K.: Map-reduce for machine learning on multicore. In: Advances in Neural Information Processing Systems (NIPS), pp. 281–288 (2007) Google Scholar
  6. 6.
    Condie, T., Conway, N., Alvaro, P., Hellerstein, Elmeleegy, K., Sears, R.: Online aggregation and continuous query support in mapreduce. In: NSDI’10: Seventh USENIX Symposium on Networked Systems Design and Implementation. USENIX Association, Berkeley (2010) Google Scholar
  7. 7.
    Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: Proc. of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), pp. 137–149 (2004) Google Scholar
  8. 8.
    DeWitt, D., Gerber, R.: Multiprocessor hash-based join algorithms. In: Proc. of VLDB Conference. Citeseer, Princeton (1985) Google Scholar
  9. 9.
    Dyer, C., Cordova, A., Mont, A., Lin, J.: Fast, easy, and cheap: construction of statistical machine translation models with mapreduce. In: Proc. of the Third Workshop on Statistical Machine Translation (StatMT), pp. 199–207. Association for Computational Linguistics, Stroudsburg (2008) CrossRefGoogle Scholar
  10. 10.
    Elsayed, T., Lin, J., Oard, D.W.: Pairwise document similarity in large collections with mapreduce. In: Proc. of the Annual Meeting of the Association for Computational Linguistics on Human Language Technologies (HLT), pp. 265–268. Association for Computational Linguistics, Stroudsburg (2008) Google Scholar
  11. 11.
    Farivar, R., Verma, A., Chan, E., Campbell, R.: Mithra: multiple data independent tasks on a heterogeneous resource architecture. In: Proc. of IEEE International Conference on Cluster Computing and Workshops (CLUSTER), pp. 1–10 (2009). doi:10.1109/CLUSTR.2009.5289201 Google Scholar
  12. 12.
    Fix, E., Hodges, J.: Discriminatory analysis, nonparametric discrimination: Consistency properties. Tech. rep., USAF School of Aviation Medicine, Randolph Field, Texas (1951) Google Scholar
  13. 13.
    Guibas, L.J., Sedgewick, R.: A dichromatic framework for balanced trees. In: Proc. of Symposium on Foundations of Computer Science, pp. 8–21 (1978). doi:10.1109/SFCS.1978.3 Google Scholar
  14. 14.
    Gupta, A., Mumick, I.: Maintenance of materialized views: problems, techniques, and applications. Bull. Tech. Commun. 51, 3 (2010) Google Scholar
  15. 15.
    Illinois Cloud Computing Testbed. http://cloud.cs.illinois.edu/ (2011)
  16. 16.
    Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proc. of European Conference on Computer Systems (EuroSys), pp. 59–72 (2007) Google Scholar
  17. 17.
    Mongodb. http://www.mongodb.org (2011)
  18. 18.
    Olson, M.A., Bostic, K., Seltzer, M.: Berkeley db. In: Proc. of the USENIX Annual Technical Conference (ATEC), p. 43. USENIX Association, Berkeley (1999) Google Scholar
  19. 19.
    O’Malley, O., Murthy, A.C.: Winning a 60 second dash with a yellow elephant. http://sortbenchmark.org/Yahoo2009.pdf (2009)
  20. 20.
    Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., Dewitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proc. of the ACM SIGMOD International Conference. ACM (2009). URL http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf Google Scholar
  21. 21.
    Popa, L., Budiu, M., Yu, Y., Isard, M.: DryadInc: Reusing work in large-scale computations. In: Proc. of Workshop on Hot Topics in Cloud Computing (HotCloud) (2009) Google Scholar
  22. 22.
    Stonebraker, M., Abadi, D., DeWitt, D., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and parallel DBMSs: friends or foes? Commun. ACM 53(1), 64–71 (2010) CrossRefGoogle Scholar
  23. 23.
    Tokyo cabinet. http://1978th.net/tokyocabinet/ (2011)
  24. 24.
    Verma, A., Llora, X., Goldberg, D.E., Campbell, R.H.: Scaling genetic algorithms using mapreduce. In: Proc. of International Conference on Intelligent Systems Design and Applications (2009) Google Scholar
  25. 25.
    Verma, A., Zea, N., Cho, B., Gupta, I., Campbell, R.: Breaking the MapReduce stage barrier. In: Proc. of IEEE International Conference on Cluster Computing (CLUSTER), pp. 235–244 (2010) Google Scholar
  26. 26.
    White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Sebastopol (2009) Google Scholar
  27. 27.
    Yang, H., Dasdan, A., Hsiao, R., Parker, D.: Map-reduce-merge: simplified relational data processing on large clusters. In: Proc. of the 2007 ACM SIGMOD International Conf. on Management of data (2007) Google Scholar
  28. 28.
    Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R., Stoica, I.: Improving mapreduce performance in heterogeneous environments. In: Proc. of the USENIX Symposium on Operating Systems Design and Implementation (OSDI) (2008) Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  • Abhishek Verma
    • 1
  • Brian Cho
    • 1
  • Nicolas Zea
    • 1
  • Indranil Gupta
    • 1
  • Roy H. Campbell
    • 1
  1. 1.Department of Computer ScienceUniversity of Illinois at Urbana-ChampaignUrbanaUSA

Personalised recommendations