Advertisement

mpCache: Accelerating MapReduce with Hybrid Storage System on Many-Core Clusters

  • Bo Wang
  • Jinlei Jiang
  • Guangwen Yang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8707)

Abstract

As a widely used programming model and implementation for processing large data sets, MapReduce does not scale well on many-core clusters, which, unfortunately, are common in current data centers. To deal with the problem, this paper: 1) analyzes the causes of poor scalability of MapReduce on many-core clusters and identifies the key one as the underlying low-speed storage (hard disk) can not meet the requirements of frequent IO operations, and 2) proposes mpCache, a SSD based hybrid storage system that caches both Input Data and Localized Data, and dynamically tunes the cache space allocation between them to make full use of the space. mpCache has been incorporated into Hadoop and evaluated on a 7-node cluster by 13 benchmarks. The experimental results show that mpCache gains an average speedup of 2.09 when compared with the original Hadoop, and achieves an average speedup of 1.79 when compared with PACMan, the latest in-memory optimization of MapReduce.

Keywords

Cache Size Average Speedup Reduce Phase Data Node Solid State Drive 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Ahmad, F., Lee, S., Thottethodi, M., Vijaykumar, T.: Puma: Purdue mapreduce benchmarks suite (2012), http://web.ics.purdue.edu/~fahmad/benchmarks.htm
  2. 2.
    Ananthanarayanan, G., Ghodsi, A., Wang, A., Borthakur, D., Kandula, S., Shenker, S., Stoica, I.: Pacman: Coordinated memory caching for parallel jobs. In: Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, NSDI 2012, p. 20. USENIX (2012)Google Scholar
  3. 3.
    Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: Haloop: Efficient iterative data processing on large clusters. Proceedings of the VLDB Endowment 3(1-2), 285–296 (2010)CrossRefGoogle Scholar
  4. 4.
    Chen, F., Koufaty, D.A., Zhang, X.: Hystor: making the best use of solid state drives in high performance storage systems. In: Proceedings of the International Conference on Supercomputing, ICS 2011, pp. 22–32. ACM (2011)Google Scholar
  5. 5.
    Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  6. 6.
    Feeley, M.J., Morgan, W.E., Pighin, E., Karlin, A.R., Levy, H.M., Thekkath, C.A.: Implementing global memory management in a workstation cluster. ACM (1995)Google Scholar
  7. 7.
    Ghemawat, S., Gobioff, H., Leung, S.-T.: The google file system. ACM SIGOPS Operating Systems Review 37, 29–43 (2003)CrossRefGoogle Scholar
  8. 8.
    Handy, J.: Flash memory vs. hard disk drives - which will win?, http://www.storagesearch.com/semico-art1.html
  9. 9.
    Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. ACM SIGOPS Operating Systems Review 41(3), 59–72 (2007)CrossRefGoogle Scholar
  10. 10.
    Kim, Y., Gupta, A., Urgaonkar, B., Berman, P., Sivasubramaniam, A.: Hybridstore: A cost-efficient, high-performance storage system combining ssds and hdds. In: 2011 IEEE 19th International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems, MASCOTS 2011, pp. 227–236. IEEE (2011)Google Scholar
  11. 11.
    Knuth, D.E.: The art of computer programming, vol. 3. Addison-Wesley, Reading Mass. Pearson Education (2005)Google Scholar
  12. 12.
    Oh, Y., Choi, J., Lee, D., Noh, S.H.: Caching less for better performance: Balancing cache size and update cost of flash memory cache in hybrid storage systems. In: Proceedings of the 10th USENIX Conference on File and Storage Technologies, FAST 2012, p. 25. USENIX (2012)Google Scholar
  13. 13.
    Ousterhout, J., Agrawal, P., Erickson, D., Kozyrakis, C., Leverich, J., Mazières, D., Mitra, S., Narayanan, A., Parulkar, G., Rosenblum, M., et al.: The case for ramclouds: scalable high-performance storage entirely in dram. ACM SIGOPS Operating Systems Review 43(4), 92–105 (2010)CrossRefGoogle Scholar
  14. 14.
    Pritchett, T., Thottethodi, M.: Sievestore: a highly-selective, ensemble-level disk cache for cost-performance. In: Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA 2010, pp. 163–174. ACM (2010)Google Scholar
  15. 15.
    Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., Kozyrakis, C.: Evaluating mapreduce for multi-core and multiprocessor systems. In: IEEE 13th International Symposium on High Performance Computer Architecture, HPCA 2007, pp. 13–24. IEEE (2007)Google Scholar
  16. 16.
    Schindler, J., Shete, S., Smith, K.A.: Improving throughput for small disk requests with proximal i/o. In: Proceedings of the 9th USENIX Conference on File and Storage Technologies, FAST 2011, pp. 133–147. USENIX (2011)Google Scholar
  17. 17.
    Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies, MSST 2010, pp. 1–10. IEEE (2010)Google Scholar
  18. 18.
    Stuart, J.A., Owens, J.D.: Multi-gpu mapreduce on gpu clusters. In: 2011 IEEE International Parallel & Distributed Processing Symposium, IPDPS 2011, pp. 1068–1079. IEEE (2011)Google Scholar
  19. 19.
    Talbot, J., Yoo, R.M., Kozyrakis, C.: Phoenix++: modular mapreduce for shared-memory systems. In: Proceedings of the Second International Workshop on MapReduce and Its Applications, pp. 9–16. ACM (2011)Google Scholar
  20. 20.
    Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment 2(2), 1626–1629 (2009)CrossRefGoogle Scholar

Copyright information

© IFIP International Federation for Information Processing 2014

Authors and Affiliations

  • Bo Wang
    • 1
  • Jinlei Jiang
    • 1
    • 2
  • Guangwen Yang
    • 1
  1. 1.Department of Computer Science and Technology, Tsinghua National Laboratory for Information Science and Technology (TNLIST)Tsinghua UniversityBeijingChina
  2. 2.Technology Innovation Center at YinzhouYangtze Delta Region Institute of Tsinghua UniversityZhejiangChina

Personalised recommendations