Cluster Computing

, Volume 17, Issue 2, pp 293–301 | Cite as

Accelerating MapReduce framework on multi-GPU systems

  • Hai Jiang
  • Yi Chen
  • Zhi Qiao
  • Kuan-Ching LiEmail author
  • WonWoo Ro
  • Jean-Luc Gaudiot


Graphics processors evolve rapidly and promise to support power-efficient, cost, differentiated price-performance, and scalable high performance computing. MapReduce is a well-known distributed programming model to ease the development of applications for large-scale data processing on a large number of commodity CPUs. When compared to CPUs, GPUs are an order of magnitude faster in terms of computation power and memory bandwidth, but they are harder to program. Although several studies have implemented the MapReduce model on GPUs, most of them are based on the single GPU model and bounded by a GPU memory with inefficient atomic operations. This paper focuses on the development of MGMR, a standalone MapReduce system that utilizes multiple GPUs to manage large-scale data processing beyond the GPU memory limitation, and also to eliminate serial atomic operations. Experimental results have demonstrated the effectiveness of MGMR in handling large data sets.


GPU MapReduce Large scale data processing Multi-GPUs 



Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. This research is based upon work partially supported by National Science Foundation, USA (Awards No. 0918970 and CCF-1065448), National Science Council (NSC), Taiwan, under grants NSC101-2221-E-126-002 and NSC101-2915-I-126-001, and NVIDIA.


  1. 1.
  2. 2.
    OpenCL—The open standard for parallel programming of heterogeneous systems.
  3. 3.
    Caylor, M.: Numerical solution of the wave equation on dual-GPU platforms using Brook+. Presentation, Boise State University (2010) Google Scholar
  4. 4.
    Shainer, G., Ayoub, A., Lui, P., Kagan, M., Trott, C., Scantlen, G., Crozier, P.: The development of Mellanox/NVIDIA GPU direct over InfiniBand a new model for GPU to GPU communications. Comput. Sci. Res. Dev. 26(3–4), 267–273 (2011) CrossRefGoogle Scholar
  5. 5.
    Ekanayake, J., Pallickara, S., Fox, G.: MapReduce for data intensive scientific analyses. In: eScience’08. IEEE Fourth International Conference on eScience (2008) Google Scholar
  6. 6.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008) CrossRefGoogle Scholar
  7. 7.
    Elteir, M., Lin, H., Feng, W., Scogland, T.: StreamMR: an optimized MapReduce framework for AMD GPUs. In: Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, pp. 364–371 (2011) Google Scholar
  8. 8.
    Fang, W., He, B., Luo, Q., Govindaraju, N.K.: Mars: accelerating MapReduce with graphics processors. In: Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed Systems, pp. 608–620 (2011) Google Scholar
  9. 9.
    Hong, C.T., Chen, D.H., Chen, Y.B., Chen, W.G., Zheng, W.M., Lin, H.B.: Providing source code level portability between CPU and GPU with MapCG. J. Comput. Sci. Technol. 27(1), 42–56 (2012) CrossRefGoogle Scholar
  10. 10.
    Alam, S.R., Fourestey, G., Videau, B., Genovese, L., Goedecker, S., Dugan, N.: Overlapping computations with communications and I/O explicitly using OpenMP based heterogeneous threading models. In: Proceedings of the 8th International Conference on OpenMP in a Heterogeneous World, pp. 267–270 (2012) CrossRefGoogle Scholar
  11. 11.
    White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Inc./Yahoo Press, Sebastopol (2010) Google Scholar
  12. 12.
    Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., Kozyraki, C.: Evaluating MapReduce for multi-core and multiprocessor systems. In: Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, pp. 13–24 (2007) CrossRefGoogle Scholar
  13. 13.
    Chen, L., Agrawal, G.: Optimizing MapReduce for GPUs with effective shared memory usage. In: Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, pp. 199–210 (2012) CrossRefGoogle Scholar
  14. 14.
    Stuart, J.A., Owens, J.D.: Multi-GPU MapReduce on GPU clusters. In: Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium, pp. 1068–1079 (2011) CrossRefGoogle Scholar
  15. 15.
    Bell, N., Hoberock, J.: Thrust: a productivity-oriented library for CUDA. In: GPU Computing Gems: Jade Edition, pp. 359–371. Morgan Kaufmann, San Francisco (2011) Google Scholar
  16. 16.
    Li, X., Lu, P., Schaeffer, J., Shillington, J., Wong, P.S., Shi, H.: On the versatility of parallel sorting by regular sampling. J. Parallel Comput. 19(10), 1079–1103 (1993) CrossRefzbMATHMathSciNetGoogle Scholar
  17. 17.
    Przydatek, B.: A fast approximation algorithm for the subset-sum problem. J. Int. Trans. Oper. Res. 9(4), 437–459 (2002) CrossRefzbMATHMathSciNetGoogle Scholar
  18. 18.
    Yu, S., Tranchevent, L.-C., Liu, X., Glanzel, W., Suykens, J.A.K., De Moor, B., Moreau, Y.: Optimized data fusion for kernel k-means clustering. IEEE Trans. Pattern Anal. Mach. Intell. 34(5), 1031–1039 (2012) CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Hai Jiang
    • 1
  • Yi Chen
    • 1
  • Zhi Qiao
    • 1
  • Kuan-Ching Li
    • 2
    Email author
  • WonWoo Ro
    • 3
  • Jean-Luc Gaudiot
    • 4
  1. 1.Dept. of Computer ScienceArkansas State UniversityJonesboroUSA
  2. 2.Dept. of Computer Science and Information Engr.Providence UniversityTaichungTaiwan
  3. 3.School of Electrical and Electronic EngineeringYonsei UniversitySeoulKorea
  4. 4.Dept. of Electrical Engr. and Computer ScienceUniversity of California, IrvineIrvineUSA

Personalised recommendations