Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

A comparative study of GPU programming models and architectures using neural networks


Recently, General Purpose Graphical Processing Units (GP-GPUs) have been identified as an intriguing technology to accelerate numerous data-parallel algorithms. Several GPU architectures and programming models are beginning to emerge and establish their niche in the High-Performance Computing (HPC) community. New massively parallel architectures such as the Nvidia’s Fermi and AMD/ATi’s Radeon pack tremendous computing power in their large number of multiprocessors. Their performance is unleashed using one of the two GP-GPU programming models: Compute Unified Device Architecture (CUDA) and Open Computing Language (OpenCL). Both of them offer constructs and features that have direct bearing on the application runtime performance. In this paper, we compare the two GP-GPU architectures and the two programming models using a two-level character recognition network. The two-level network is developed using four different Spiking Neural Network (SNN) models, each with different ratios of computation-to-communication requirements. To compare the architectures, we have chosen the two extremes of the SNN models for implementation of the aforementioned two-level network. An architectural performance comparison of the SNN application running on Nvidia’s Fermi and AMD/ATi’s Radeon is done using the OpenCL programming model exhausting all of the optimization strategies plausible for the two architectures. To compare the programming models, we implement the two-level network on Nvidia’s Tesla C2050 based on the Fermi architecture. We present a hierarchy of implementations, where we successively add optimization techniques associated with the two programming models. We then compare the two programming models at these different levels of implementation and also present the effect of the network size (problem size) on the performance. We report significant application speed-up, as high as 1095× for the most computation intensive SNN neuron model, against a serial implementation on the Intel Core 2 Quad host. A comprehensive study presented in this paper establishes connections between programming models, architectures and applications.

This is a preview of subscription content, log in to check access.


  1. 1.

    Intel’s teraflops chip uses mesh architecture to emulate mainframe. http://www.eetimes.com/electronics-products/processors/4091586/Intel-s-teraflops-chip-uses-mesh-architecture-to-emulate-mainframe

  2. 2.

    Tilera’s homepage. http://www.tilera.com/products/processors

  3. 3.

    NVIDIA CUDA programming guide. http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/NVIDIA_CUDA_ProgrammingGuide.pdf

  4. 4.

    Ligowski L, Rudnicki W (2009) An efficient implementation of Smith Waterman algorithm on GPU using CUDA, for massively parallel scanning of sequence databases. In: Proceedings of IPDPS 2009, Rome, Italy, May 2009

  5. 5.

    Phillips JC, Stone JE, Schulten K (2008) Adapting a message-driven parallel application to GPU-accelerated clusters. In: Proceedings of SC 2008, Austin, TX, November 2008

  6. 6.

    OpenCL-open standard for parallel programming of heterogeneous systems. http://www.khronos.org/opencl/

  7. 7.

    Izhikevich E (2004) Which model to use for cortical spiking neurons? IEEE Trans Neural Netw 15(5):1063–1070

  8. 8.

    Izhikevich EM (2003) Simple model to use for cortical spiking neurons. IEEE Trans Neural Netw 14(6):1569–1572

  9. 9.

    Wilson HR (1999) Simplified dynamics of human and mammalian neocortical neurons. J Theor Biol 200:375–388

  10. 10.

    Morris C, Lecar H (1981) Voltage oscillations in the barnacle giant muscle fiber. Biophys J 35:193–213

  11. 11.

    Hodgkin AL, Huxley AF (1952) A quantitative description of membrane current and application to conduction and excitation in nerve. J Physiol 117:500–544

  12. 12.

    Bhuiyan MA, Pallipuram, VK, Smith MC (2010) Acceleration of spiking neural networks in emerging multi-core and GPU architectures. In: HiCOMB 2010, a workshop in IPDPS 2010, Atlanta, GA, April 2010

  13. 13.

    Gupta A, Long L (2007) Character recognition using spiking neural networks. In: Proc. IJCNN, pp. 53–58, August 2007

  14. 14.

    Technical Brief: NVIDIA GeForce 8800 GPU architecture overview. www.nvidia.com

  15. 15.

    NVIDIA’s next generation CUDA compute architecture: Fermi. http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIAFermiComputeArchitectureWhitepaper.pdf

  16. 16.

    ATI Mobility Radeon HD 5870 GPU specifications. http://www.amd.com/us/products/notebook/graphics/ati-mobility-hd-5800/Pages/hd-5870-specs.aspx

  17. 17.

    NVIDIA CUDA C programming best practices guide. http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/NVIDIA_CUDA_BestPracticesGuide_2.3.pdf

  18. 18.

    NVIDIA OpenCL programming guide. http://developer.download.nvidia.com/compute/cuda/3_1/toolkit/docs/NVIDIA_OpenCL_ProgrammingGuide.pdf

  19. 19.

    Du P, Weber R, Tomov S, Peterson G, Dongarra J (2010) From CUDA to OpenCL: towards a performance-portable solution for multi-platform GPU programming. Technical Report CS-10-656, Electrical Engineering and Computer Science Department, University of Tennessee, 2010. LAPACK Working note 228

  20. 20.

    Pallipuram VK (2010) Acceleration of spiking neural networks on single-GPU and multi-GPU systems. Master’s thesis, May 2010

  21. 21.

    Johansson C, Lansner A (2007) Towards cortex sized artificial neural systems. Neural Netw 20(1), 48–61

  22. 22.

    Nene SA, Nayar SK, Murase H (1996) Columbia object image library (COIL-100) (No. CUCS-006-96): Columbia Automated Vision Environment

  23. 23.

    Ananthanarayanan R, Esser SK, Simon HD, Modha DS (2009) The cat is out of the bag: cortical simulations with 109 neurons, 1013 synapses. In: Proceedings of SC ’09, Portland, Oregon, November 2009

  24. 24.

    Rall W (1959 Branching dendritic trees and motoneuron membrane resistivity. Exp Neurol 1, 503–532

  25. 25.

    Nageswaran JM, Dutt N, Krichmar JL, Nicolau A, Veidenbauma AV (2009) A configurable simulation environment for the efficient simulation of large-scale spiking neural networks on graphics processors. Neural Netw 22(5–6), 791–800. Special issue

  26. 26.

    Khanna G., McKennon J. (2010) Numerical modeling of gravitational wave sources accelerated by OpenCL. Comput Phys Commun 181(9), 1605–1611

  27. 27.

    Karimi K, Dickson NG, Hamze F (2010) A performance comparison of CUDA and OpenCL. The Computing Research Repository (CoRR), arXiv:1005.2581

  28. 28.

    Bhuiyan MA, Taha TM, Jalasutram R (2009) Character recognition with two spiking neural network models on multi-core architectures. In: Proceedings of IEEE symposium on CIMSVP, Nashville, TN, March 2009, pp. 29–34

  29. 29.

    ATI stream computing OpenCL. http://developer.amd.com/gpu/ATIStreamSDK/assets/ATI_Stream_SDK_OpenCL_Programming_Guide.pdf

  30. 30.

    CUDA visual profiler release notes. http://developer.download.nvidia.com/compute/cuda/3_0/sdk/docs/OpenCL_release_notes.txt

  31. 31.

    ATI stream profiler. http://developer.amd.com/gpu/StreamProfiler/Pages/default.aspx

  32. 32.

    Stream KernelAnalyzer. http://developer.amd.com/gpu/ska/pages/default.aspx

Download references

Author information

Correspondence to Melissa C. Smith.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Pallipuram, V.K., Bhuiyan, M. & Smith, M.C. A comparative study of GPU programming models and architectures using neural networks. J Supercomput 61, 673–718 (2012). https://doi.org/10.1007/s11227-011-0631-3

Download citation


  • CUDA
  • OpenCL
  • Fermi
  • Radeon
  • Spiking neural network
  • Programming models
  • Architectures
  • Speed-up
  • Profiler counters