Static GPU Threads and an Improved Scan Algorithm

  • Jens Breitbart
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6586)


Current GPU programming systems automatically distribute the work on all GPU processors based on a set of fixed assumptions, e.g. that all tasks are independent from each other. We show that automatic distribution limits algorithmic design, and demonstrate that manual work distribution hardly adds any overhead. Our Scan + algorithm is an improved scan relying on manual work distribution. It uses global barriers and task interleaving to provides almost twice the performance of Apple’s reference implementation [1].


Global Synchronization Distribute Processing Symposium Cell Broadband Engine Kernel Call Global Barrier 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Apple Inc. OpenCL Parallel Prefix Sum (aka Scan) Example Version 1.5. (2010),
  2. 2.
    Blelloch, G.E.: Scans as primitive parallel operations. IEEE Trans. Computers 38(11), 1526–1538 (1989)CrossRefGoogle Scholar
  3. 3.
    Breitbart, J., Fohry, C.: OpenCL – an effective programming model for data parallel computations at the Cell Broadband Engine. In: IEEE Int. Parallel and Distributed Processing Symposium (2010) (to appear)Google Scholar
  4. 4.
    Kirk, D., Hwu, W.-m.: Programming Massively Parallel Processors: A Hands-on Approach, 1st edn. Morgan Kaufmann, San Francisco (February 2010)Google Scholar
  5. 5.
    Harris, M., Sengupta, S., Owens, J.D.: Parallel prefix sum (scan) with CUDA. In: Nguyen, H. (ed.) GPU Gems 3, ch. 39, pp. 851–876. Addison Wesley, Reading (August 2007)Google Scholar
  6. 6.
    NVIDIA Corporation. PTX: Parallel Thread Execution ISA Version 2.0 (2010)Google Scholar
  7. 7.
    Stuart, J.A., Owens, J.D.: Message passing on data-parallel architectures. In: Stuart, J.A., Owens, J.D. (eds.) Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium (May 2009)Google Scholar
  8. 8.
    Xiao, S., Feng, W.-c.: Inter-Block GPU Communication via Fast Barrier Synchronization. In: Proceedings of the 24th IEEE International Parallel and Distributed Processing Symposium (IPDPS), Atlanta, Georgia, USA (April 2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Jens Breitbart
    • 1
  1. 1.Research Group Programming Languages / MethodologiesUniversität KasselKasselGermany

Personalised recommendations