Smart Containers and Skeleton Programming for GPU-Based Systems

Article

Abstract

In this paper, we discuss the role, design and implementation of smart containers in the SkePU skeleton library for GPU-based systems. These containers provide an interface similar to C++ STL containers but internally perform runtime optimization of data transfers and runtime memory management for their operand data on the different memory units. We discuss how these containers can help in achieving asynchronous execution for skeleton calls while providing implicit synchronization capabilities in a data consistent manner. Furthermore, we discuss the limitations of the original, already optimizing memory management mechanism implemented in SkePU containers, and propose and implement a new mechanism that provides stronger data consistency and improves performance by reducing communication and memory allocations. With several applications, we show that our new mechanism can achieve significantly (up to 33.4 times) better performance than the initial mechanism for page-locked memory on a multi-GPU based system.

Keywords

SkePU Smart containers Skeleton programming  Memory management Runtime optimizations GPU-based systems 

References

  1. 1.
    Alexandrescu, A.: Modern C++ Design, 1st edn. Addison-Wesley Professional, Boston (2001)Google Scholar
  2. 2.
    Aufmann, R., Barker, V., Lockwood, J.: Intermediate Algebra with Applications, Multimedia Edition. Cengage Learning (2008). URL http://books.google.se/books?id=QYfJAxqwDE8C
  3. 3.
    Ciechanowicz, P., Poldner, M., Kuchen, H.: The Münster skeleton library Muesli—a comprehensive overview (2009). ERCIS Working Paper No. 7Google Scholar
  4. 4.
    Cole, M.I.: Algorithmic Skeletons: Structured Management of Parallel Computation. Addison-Wesley, Cambridge (1989)MATHGoogle Scholar
  5. 5.
    Dastgeer, U.: Skeleton programming for heterogeneous GPU-based systems. Licentiate thesis. Thesis No. 1504. Department of Computer and Information Science, Linköping University (2011). URL http://liu.diva-portal.org/smash/record.jsf?pid=diva2:437140
  6. 6.
    Dastgeer, U.: Performance-aware component composition for GPU-based systems. Ph.D. thesis, Linköping University (2014). URL http://www.diva-portal.org/smash/record.jsf?pid=diva2:712422
  7. 7.
    Dastgeer, U., Kessler, C., Thibault, S.: Flexible runtime support for efficient skeleton programming. In: Advances in Parallel Computing, vol. 22, pp. 159–166. IOS Press (2012). Proc. ParCo conference, Ghent, Belgium (Sep . 2011)Google Scholar
  8. 8.
    Diogo, M., Grelck, C.: Towards Heterogeneous Computing without Heterogeneous Programming. In: H.W. Loidl, R. Pena (eds.): 13th Int. Symposium on Trends in Functional Programming (TFP 2012), St. Andrews, UK, Lecture Notes in Computer Science 7829, pp. 279–294, Springer (2013)Google Scholar
  9. 9.
    Dubois, M., Annavaram, M., Stenström, P.: Parallel Computer Organization and Design. Cambridge University Press, Cambridge (2012)CrossRefGoogle Scholar
  10. 10.
    Enmyren, J., Kessler, C.: SkePU: A Multi-Backend Skeleton Programming Library for Multi-GPU Systems. In: Proceedings of 4th International Workshop on High-Level Parallel Programming and Applications (HLPP-2010), Baltimore, USA, ACM (Sep. 2010)Google Scholar
  11. 11.
    Ernsting, S., Kuchen, H.: Algorithmic skeletons for multi-core, multi-GPU systems and clusters. Int. J. High Perform. Comput. Netw. 7, 129–138 (2012)CrossRefGoogle Scholar
  12. 12.
    Goli, M., Gonzalez-Velez, H.: Heterogeneous algorithmic skeletons for FastFlow with seamless coordination over hybrid architectures. In: 21st Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 148–156 (2013)Google Scholar
  13. 13.
    Grelck, C., Scholz, S.: SAC-A functional array language for efficient multi-threaded execution. Int. J. Parallel Program. 34(4), 383–427 (2006)CrossRefMATHGoogle Scholar
  14. 14.
    Harris, M.: CUDA Unfied Memory in CUDA 6. Nvidia, http://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6 (2013)
  15. 15.
    Hoberock, J., Bell, N.: Thrust: C++ template library for CUDA (2011). http://code.google.com/p/thrust/
  16. 16.
    Keckler, S.W., Dally, W.J., Khailany, B., Garland, M., Glasco, D.: GPUs and the future of parallel computing. IEEE Micro. 31(5), 7–17 (2011). doi:10.1109/MM.2011.89 CrossRefGoogle Scholar
  17. 17.
    Kicherer, M., Buchty, R., Karl, W.: Cost-aware function migration in heterogeneous systems. In: 6th International Conference on High Performance and Embedded Architectures and Compilers. HiPEAC ’11, pp. 137–145. ACM, New York, NY, USA (2011)Google Scholar
  18. 18.
    Kicherer, M., Nowak, F., Buchty, R., Karl, W.: Seamlessly portable applications: managing the diversity of modern heterogeneous systems. ACM Trans. Archit. Code Optim. 8(4), 42:1–42:20 (2012)CrossRefGoogle Scholar
  19. 19.
    Landaverde, R., Zhang, T., Coskun, A., Herbordt, M.: An investigation of Unified Memory access performance in CUDA. In: IEEE High Performance Extreme Computing Conference, Waltham, USA (2014)Google Scholar
  20. 20.
    Marques, R., Paulino, H., Alexandre, F., Medeiros, P.D.: Algorithmic skeleton framework for the orchestration of GPU computations. In: Euro-Par 2013 Parallel Processing. Lecture Notes in Computer Science, vol. 8097, pp. 874–885. Springer, Berlin Heidelberg (2013)Google Scholar
  21. 21.
    NVIDIA Corporation: NVIDIA CUDA C Programming Guide (2013). http://docs.nvidia.com/cuda/cuda-c-programming-guide
  22. 22.
    Park, J.: Memory optimizations of embedded applications for energy efficiency. Ph.D. thesis, Dept. of Electrical Engineering. University of Stanford (2011)Google Scholar
  23. 23.
    Shainer, G., et al.: The development of Mellanox/NVIDIA GPUDirect over InfiniBand—a new model for GPU to GPU communications. Comput. Sci.-Res. Dev. 26, 3–4 (2011)CrossRefGoogle Scholar
  24. 24.
    Steuwer, M., Kegel, P., Gorlatch, S.: SkelCL—A portable skeleton library for high-level GPU programming. In: 16th International Workshop on High-Level Parallel Programming Models and Supportive Environments, HIPS ’11 (2011)Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.PELAB, Department of Computer and Information ScienceLinköping UniversityLinköpingSweden

Personalised recommendations