A Middleware Framework for Programmable Multi-GPU-Based Big Data Applications

  • Ettikan K. KaruppiahEmail author
  • Yong Keh KokEmail author
  • Keeratpal SinghEmail author


Current application of GPU processors for parallel computing tasks shows excellent results in terms of speedups compared to CPU processors. However, there is no existing middleware framework that enables automatic distribution of data and processing across heterogeneous computing resources for structured and unstructured Big Data applications. Thus, we propose a middleware framework for “Big Data” analytics that provides mechanisms for automatic data segmentation, distribution, execution, information retrieval across multiple cards (CPU and GPU) and machines, a modular design for easy addition of new GPU kernels at both analytic and processing layer, and information presentation. The architecture and components of the framework such as multi-card data distribution and execution, data structures for efficient memory access, algorithms for parallel GPU computation, and results for various test configurations are shown. Our results show proposed middleware framework, providing alternative and cheaper HPC solution to users. Data cleansing algorithms on GPU show a speedup of over two orders of magnitude compared to the same operation done in MySQL on a multi-core machine. Our framework is also capable of processing more than 120 million of health data within 11 s.


GPGPU CUDA GPU Architecture Big Data High-performance computing Middleware framework 



This research was done under joint lab of “NVIDIA-HP-MIMOS GPU R&D and Solution Center.” This is the first GPU solution center in Southeast Asia established in October 2012. Funding for the work came from MOSTI, Malaysia. The authors would like to thank Prof. Simon See and Pradeep Gupta from NVIDIA for their support.


  1. 1.
    Fang, W., et al.: Parallel data mining on graphics processors. Technical Report (2008)Google Scholar
  2. 2.
    Gregg, C., Hazelwood, K.: Where is the data? Why you cannot debate CPU vs. GPU performance without the answer. In: IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 134–144 (2011). doi: 10.1109/ISPASS.2011.5762730
  3. 3.
    Bakkum, P., Skadron, K.: Accelerating SQL database operations on a GPU with CUD. In: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pp. 94–103. New York, NY: ACM (2010). ISBN: 978-1-60558-935-0, doi: 10.1145/1735688.1735706
  4. 4.
    He, B., et al.: Mars: a MapReduce framework on graphics processors. In: Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pp. 260–269. New York, NY: ACM (2008). ISBN: 978-1-60558-282-5, doi: 10.1145/1454115.1454152.
  5. 5.
    Dean, J., Ghemawat, S. MapReduce: simplified data processing on large clusters. Communications of the ACM, vol. 51, pp. 107–113. New York, NY: ACM (2008). ISSN: 0001-0782, doi: 10.1145/1327452.1327492
  6. 6.
    Wolfe Gordon, A., Lu, P.: Elastic phoenix: Malleable MapReduce for shared-memory systems. In: Altman, E., Shi, W. (eds.) Network and Parallel Computing, vol. 6985, pp. 1–16. Springer, Heidelberg (2011)Google Scholar
  7. 7.
    Hong, C., et al.: MapCG: writing parallel program portable between CPU and GPU. In: Proceedings of the 19th international conference on Parallel architectures and compilation techniques, pp. 217–226. New York, NY: ACM (2010). ISBN: 978-1-4503-0178-7, doi: 10.1145/1854273.1854303
  8. 8.
    Shirahata, K., Sato, H., Matsuoka, S.: Hybrid map task scheduling for GPU-based heterogeneous clusters. In: IEEE Second International Conference on Cloud Computing Technology and Science (CloudCom), pp. 733–740 (2010). doi: 10.1109/CloudCom.2010.55
  9. 9.
    Stuart, J. A., Owens, J. D.: Multi-GPU MapReduce on GPU clusters. IEEE Computer Society. In: Proceedings of the 2011 I.E. International Parallel & Distributed Processing Symposium, pp. 1068–1079. Washington, DC. (2011). ISBN: 978-0-7695-4385-7, doi: 10.1109/IPDPS.2011.102
  10. 10.
    Catanzaro, B., Sundaram, N., Keutzer, K.: A map reduce framework for programming graphics processors. In: Third Workshop on Software Tools for MultiCore Systems (STMCS) (2008)Google Scholar
  11. 11.
    Choksuchat, C., Chantrapornchai, C.: Experimental framework for searching large RDF on GPUs based on key-value storage. In: 10th International Joint Conference on Computer Science and Software Engineering (JCSSE), pp. 171–176 (2013). doi: 10.1109/JCSSE.2013.6567340
  12. 12.
    NVIDIA Corporation. OpenACC. (2011). Accessed 4 Aug 2013
  13. 13.
    Wolfe, M.: Implementing the PGI accelerator model. New York, NY: ACM. In: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pp. 43–50 (2010). ISBN: 978-1-60558-935-0, doi: 10.1145/1735688.1735697
  14. 14.
    Ghosh, S., et al.: Experiences with OpenMP, PGI, HMPP and OpenACC directives on ISO/TTI Kernels. In: High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion, pp. 691–700 (2012). doi: 10.1109/SC.Companion.2012.95
  15. 15.
    Munshi, A.: The OpenCL specification. Khronos OpenCL Working Group. Technical Report (2009)Google Scholar
  16. 16.
    Torres, Y., Gonzalez-Escribano, A., Llanos, D.R.: Using fermi architecture knowledge to speed up CUDA and OpenCL programs. In: IEEE 10th International Symposium on Parallel and Distributed Processing with Applications (ISPA), pp. 617–624 (2012). doi: 10.1109/ISPA.2012.92
  17. 17.
    Wezowicz, M., Taufer, M.: On the cost of a general GPU framework: the strange case of CUDA 4.0 vs. CUDA 5.0. In: High Performance Computing, Networking, Storage and Analysis (SCC), SC Companion, pp. 1535–1536 (2012). doi: 10.1109/SC.Companion.2012.310
  18. 18.
    Shen, J., et al.: Performance traps in OpenCL for CPUs. In: 21st Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 38–45 (2013). ISSN: 1066-6192, doi: 10.1109/PDP.2013.16
  19. 19.
    NVIDIA Corporation.: CUDA C Programming Guide. s.l. NVIDIA Corporation (2012)Google Scholar
  20. 20.
    Sanders, J., Kandrot, E.: CUDA by example: an introduction to general-purpose GPU programming. Addison-Wesley Professional. (2010). ISBN: 0131387685Google Scholar
  21. 21.
    Wilt, N.: CUDA handbook: a comprehensive guide to GPU programming. Addison-Wesley Professional, (2013). ISBN: 0321809467Google Scholar
  22. 22.
    Kirk, D.B., Hwu, W-m.W.: Programming massively parallel processors, second edition: a hands-on approach. Morgan Kaufmann, Burlington, MA (2012). ISBN: 0124159923Google Scholar
  23. 23.
    Hollis, C.: IDC digital universe study: Big data is here, now what? (2011). Accessed 18 July 2013
  24. 24.
    Storm—Distributed and fault-tolerant realtime computation. (2011). Accessed 10 Aug 2013
  25. 25.
    Impala—The platform for big data. (2013). Accessed 10 Aug 2013
  26. 26.
    Holton, G.A.: Value at risk: theory and practice. Academic Press, Amsterdam (2003). ISBN: 0123540100Google Scholar
  27. 27.
    Navarro, G. A guided tour to approximate string matching. ACM computing surveys, vol. 33, pp. 31–88. New York, NY: ACM. (2001). ISSN: 0360-0300, doi: 10.1145/375360.375365

Copyright information

© Springer Science+Business Media Singapore 2015

Authors and Affiliations

  1. 1.MIMOS BerhadKuala LumpurMalaysia

Personalised recommendations