Cluster Computing

, Volume 18, Issue 1, pp 369–383 | Cite as

Scaling up MapReduce-based Big Data Processing on Multi-GPU systems

  • Hai Jiang
  • Yi Chen
  • Zhi Qiao
  • Tien-Hsiung Weng
  • Kuan-Ching LiEmail author


MapReduce is a popular data-parallel processing model encompassed with recent advances in computing technology and has been widely exploited for large-scale data analysis. The high demand on MapReduce has stimulated the investigation of MapReduce implementations with different architectural models and computing paradigms, such as multi-core clusters, Clouds, Cubieboards and GPUs. Particularly, current GPU-based MapReduce approaches mainly focus on single-GPU algorithms and cannot handle large data sets, due to the limited GPU memory capacity. Based on the previous multi-GPU MapReduce version MGMR, this paper proposes an upgrade version MGMR++ to eliminate GPU memory limitation and a pipelined version, PMGMR, to handle the Big Data challenge through both CPU memory and hard disks. MGMR++ is extended from MGMR with flexible C++ templates and CPU memory utilization, while PMGMR fine-tuned the performance through the latest GPU features such as streams and Hyper-Q as well as hard disk utilization. Compared to MGMR (Jiang et al., Cluster Computing 2013), the proposed schemes achieve about 2.5-fold performance improvement, increase system scalability, and allow programmers to write straightforward MapReduce code for Big Data.


GPU Multi-GPU MapReduce Pipeline Big Data Parallel processing 



Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding agencies or institutions. This research is based upon work partially supported by National Science Foundation/USA under grant No. 0959124, Ministry of Science and Technology/Taiwan (MOST)/Taiwan under grant MOST 103-2221-E-126-010-, The Providence University research project under grant PU102-11100-A12 and NVIDIA through CUDA Center Awards.


  1. 1.
    Jiang, H., Chen, Y., Qiao, Z., Li, K.-C., Ro, W., Gaudiot, J.-C.: Accelerating MapReduce framework on multi-GPU systems. Cluster Computing, pp. 1–9. Springer, Berlin (2013)Google Scholar
  2. 2.
    Cubieboards: an Open ARM Mini PC, 2014
  3. 3.
    CUDA Programming Guide 6.0, NVIDIA, 2014Google Scholar
  4. 4.
    Dean, Jeffrey, Ghemawa, Sanjay: MapReduce: simplied data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  5. 5.
    Chen, Y., Qiao, Z., Jiang, H., Li, K.-C., Ro, W.W.: MGMR: multi-GPU based MapReduce. Grid and Pervasive Computing. Lecture Notes in Computer Science, vol. 7861, pp. 433–442. Springer, Berlin (2013)CrossRefGoogle Scholar
  6. 6.
    Bollier, D., Firestone, C.M.: The Promise and Peril of Big Data. Communications and Society Program. Aspen Institute, Washington, DC (2010)Google Scholar
  7. 7.
    Jinno, R., Seki, K., Uehara, K.: Parallel distributed trajectory pattern mining using MapReduce. In: Proceedings of IEEE 4th International Conference on Cloud Computing Technology and Science, pp. 269–273, 2012Google Scholar
  8. 8.
    Lee, D., Dinov, I., Dong, B., Gutman, B., Yanovsky, I., Toga, A.W.: CUDA optimization strategies for compute-and memory-bound neuroimaging algorithms. Comput. Methods Programs Biomed. 106, 175 (2012)CrossRefGoogle Scholar
  9. 9.
    Raina, R., Madhavan, A., Ng, A.D.: Large-scale deep unsupervised learning using graphics processors. In: Proceedings of the 26th International Conference on Machine Learning, Canada, 2009Google Scholar
  10. 10.
    Fadika, z., Dede, E., Hartog, J., Govindaraju, M.: Marla: Mapreduce for heterogeneous clusters. In: Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 49–56, 2012Google Scholar
  11. 11.
    Stuart, J.A., Owens, J.D.: Multi-GPU MapReduce on GPU clusters. In: Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium, pp. 1068–1079, 2011Google Scholar
  12. 12.
    Foster, I., Kesselman, C.: The Grid 2: blueprint for a new computing infrastructure, Morgan Kaufmann, 2003Google Scholar
  13. 13.
    Czajkowski, K., Fitzgerald, S., Foster, I., Kesselman, C.: Grid information services for distributed resource sharing. In: Proceedings of 10th IEEE International Symposium on High Performance Distributed Computing, pp. 181–194, 2001Google Scholar
  14. 14.
    White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Sebastopol (2012)Google Scholar
  15. 15.
    Chen, L., Huo, X., Agrawal, G.: Accelerating MapReduce on a coupled CPU-GPU architecture. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis 2012Google Scholar
  16. 16.
    Nakada, H., Ogawa, H., Kudoh, T.: Stream processing with big data: SSS-MapReduce. In: Proceedings of 2012 IEEE 4th International Conference on Cloud Computing Technology and Science, pp. 618–621, 2012Google Scholar
  17. 17.
    Ji, F., Ma, X.: Using shared memory to accelerate MapReduce on graphics processing units. In: Proceedings of the IEEE International Parallel & Distributed Processing Symposium, pp. 805–816, 2011Google Scholar
  18. 18.
    Chen, L., Agrawal, G.: Optimizing MapReduce for GPUs with effective shared memory usage. In: Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, pp. 199–210, 2012Google Scholar
  19. 19.
    Shainer, G., Ayoub, A., Lui, P., Liu, T., Kagan, M., Troot, C.R., Scantlen, G., Crozier, P.S.: The development of Mellanox/NVIDIA GPU Direct over InfiniBand new model for GPU to GPU communications. Computer Science-Research and Development, pp. 267–273. Springer, Berlin (2011)Google Scholar
  20. 20.
    Fang, Wenbin, He, Bingsheng, Luo, Qiong, Govindaraju, Naga K.: Mars: Accelerating MapReduce with Graphics Processors. IEEE Trans. Parallel Distrib. Syst. 22(4), 608–620 (2011)CrossRefGoogle Scholar
  21. 21.
    Elteir, M., Lin, H., Feng, W.C., Scogland, T.R.W: StreamMR: an optimized MapReduce framework for AMD GPUs. In: IEEE 17th International Conference on Parallel and Distributed Systems, pp. 364–371, 2011Google Scholar
  22. 22.
    Tuning CUDA Applications for Kepler,
  23. 23.
    Nathan, B., Jared, H.: Thrust: a productivity-oriented library for CUDA. In: GPU Computing Gems: Jade Edition, Morgan Kaufmann, pp. 359–371, 2011Google Scholar
  24. 24.
    Xiaobo, L., Paul, L., Jonathan, S., John, S., Sze, W.P., Hanmao, S.: On the versatility of parallel sorting by regular sampling. Parallel Comput. 19(10), 1079–1103 (1993)CrossRefzbMATHMathSciNetGoogle Scholar
  25. 25.
    Bartosz, P.: A fast approximation algorithm for the subset-sum problem. Int. Trans. Oper. Res. 9(4), 437–459 (2002)CrossRefzbMATHMathSciNetGoogle Scholar
  26. 26.
    FERMI Compute Architecture White Paper, NvidiaGoogle Scholar
  27. 27.
    Shi, Y., Léon-Charles, T., De, M.B., Yves, M.: Optimized data fusion for kernal k-means clustering. IEEE Trans. Pattern Anal. Mach. Intell. 34(5), 1031–1039 (2012)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Hai Jiang
    • 1
  • Yi Chen
    • 1
  • Zhi Qiao
    • 1
  • Tien-Hsiung Weng
    • 2
  • Kuan-Ching Li
    • 2
    Email author
  1. 1.Department of Computer ScienceArkansas State UniversityJonesboroUSA
  2. 2.Department of Computer Science and Information EngineeringProvidence UniversityTaichungTaiwan

Personalised recommendations