Sorting and Permuting without Bank Conflicts on GPUs

  • Peyman Afshani
  • Nodari Sitchinava
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9294)


In this paper, we look at the complexity of designing algorithms without any bank conflicts in the shared memory of Graphical Processing Units (GPUs). Given input of size n, w processors and w memory banks, we study three fundamental problems: sorting, permuting and w-way partitioning (defined as sorting an input containing exactly n/w copies of every integer in [w]).

We solve sorting in optimal \(O(\frac{n}{w} \log n)\) time. When n ≥ w2, we solve the partitioning problem optimally in O(n/w) time. We also present a general solution for the partitioning problem which takes \(O(\frac{n}{w} \log^3_{n/w} w)\) time. Finally, we solve the permutation problem using a randomized algorithm in \(O(\frac{n}{w} \log\log\log_{n/w} n)\) time. Our results show evidence that when working with banked memory architectures, there is a separation between these problems and the permutation and partitioning problems are not as easy as simple parallel scanning.


Hash Function Shared Memory Global Memory Partition Problem Memory Bank 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Afshani, P., Sitchinava, N.: Sorting and permuting without bank conflicts on GPUs. CoRR abs/1507.01391 (2015),
  2. 2.
    Aggarwal, A., Vitter, J.S.: The input/output complexity of sorting and related problems. Commun. ACM 31, 1116–1127 (1988)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Arge, L., Goodrich, M.T., Nelson, M.J., Sitchinava, N.: Fundamental parallel algorithms for private-cache chip multiprocessors. In: 20th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pp. 197–206 (2008)Google Scholar
  4. 4.
    Batcher, K.E.: Sorting networks and their applications. In: AFIPS Spring Joint Computer Conference, pp. 307–314Google Scholar
  5. 5.
    Blelloch, G.E., Chowdhury, R.A., Gibbons, P.B., Ramachandran, V., Chen, S., Kozuch, M.: Provably good multicore cache performance for divide-and-conquer algorithms. In: 19th ACM-SIAM Symp. on Discrete Algorithms, pp. 501–510 (2008)Google Scholar
  6. 6.
    Catanzaro, B., Keller, A., Garland, M.: A decomposition for in-place matrix transposition. In: 19th ACM SIGPLAN Principles and Practices of Parallel Programming (PPoPP), pp. 193–206 (2014)Google Scholar
  7. 7.
    Cole, R.: Parallel merge sort. In: 27th IEEE Symposium on Foundations of Computer Science. pp. 511–516 (1986)Google Scholar
  8. 8.
    Dotsenko, Y., Govindaraju, N.K., Sloan, P.P., Boyd, C., Manfedelli, J.: Fast Scan Algorithms on Graphics Processors. In: 22nd International Conference on Supercomputing, pp. 205–213 (2008)Google Scholar
  9. 9.
    Flynn, M.: Some computer organizations and their effectiveness. IEEE Transactions on Computers C 21(9), 948–960 (1972)CrossRefzbMATHGoogle Scholar
  10. 10.
    Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-oblivious algorithms. In: 40th IEEE Symp. on Foundations of Comp. Sci., pp. 285–298 (1999)Google Scholar
  11. 11. Research papers on,
  12. 12.
    Greiner, G.: Sparse Matrix Computations and their I/O Complexity. Dissertation, Technische Universität München, München (2012)Google Scholar
  13. 13.
    Haque, S., Maza, M., Xie, N.: A many-core machine model for designing algorithms with minimum parallelism overheads. In: High Performance Computing Symposium (2013)Google Scholar
  14. 14.
    JáJá, J.: An Introduction to Parallel Algorithms. Addison Wesley (1992)Google Scholar
  15. 15.
    Knuth, D.E.: The Art of Computer Programming, Volume III: Sorting and Searching. Addison-Wesley (1973)Google Scholar
  16. 16.
    Leighton, F.T.: Introduction to Parallel Algorithms and Architectures: Arrays, Trees, and Hypercubes. Morgan-Kaufmann, San Mateo (1991)zbMATHGoogle Scholar
  17. 17.
    Ma, L., Agrawal, K., Chamberlain, R.D.: A memory access model for highly-threaded many-core architectures. Future Generation Computer Systems 30, 202–215 (2014)CrossRefGoogle Scholar
  18. 18.
    Nakano, K.: Simple memory machine models for gpus. In: 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), pp. 794–803 (2012)Google Scholar
  19. 19.
    NVIDIA Corp.: CUDA C Best Practices Guide. Version 7.0 (March 2015)Google Scholar
  20. 20.
    Pagh, A., Pagh, R.: Uniform hashing in constant time and optimal space. SIAM Journal on Computing 38(1), 85–96 (2008)MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    Sen, S., Scherson, I.D., Shamir, A.: Shear Sort: A True Two-Dimensional Sorting Techniques for VLSI Networks. In: International Conference on Parallel Processing, pp. 903–908 (1986)Google Scholar
  22. 22.
    Sitchinava, N., Weichert, V.: Provably efficient GPU algorithms. CoRR abs/1306.5076 (2013),

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  • Peyman Afshani
    • 1
  • Nodari Sitchinava
    • 2
  1. 1.MADALGO, Aarhus UniversityAarhusDenmark
  2. 2.University of HawaiiManoaUSA

Personalised recommendations