Engineering a Multi-core Radix Sort

  • Jan Wassenberg
  • Peter Sanders
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6853)


We present a fast radix sorting algorithm that builds upon a microarchitecture-aware variant of counting sort. Taking advantage of virtual memory and making use of write-combining yields a per-pass throughput corresponding to at least 89% of the system’s peak memory bandwidth. Our implementation outperforms Intel’s recently published radix sort by a factor of 1.64. It also compares favorably to the reported performance of an algorithm for Fermi GPUs when data-transfer overhead is included. These results indicate that scalar, bandwidth-sensitive sorting algorithms remain competitive on current architectures. Various other memory-intensive applications can benefit from the techniques described herein.


Cache Line Load Imbalance Output Position Virtual Memory Radix Sort 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Bohannon, P., McIlroy, P., Rastogi, R.: Main-memory index structures with fixed-size partial keys. In: SIGMOD Conference, pp. 163–174 (2001),
  2. 2.
    Satish, N., Kim, C., Chhugani, J., Nguyen, A., Lee, V., Kim, D., Dubey, P.: Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort. In: Elmagarmid, A., Agrawal, D. (eds.) SIGMOD Conference, pp. 351–362. ACM Press, New York (2010), Google Scholar
  3. 3.
    Mehlhorn, Sanders: Scanning multiple sequences via cache memory. Algorithmica 35 (2003)Google Scholar
  4. 4.
    Intel. Intel Architecture Software Developer Manual (2010), System Programming Guide,
  5. 5.
    Intel Corporation. Intel 64 and IA-32 Architectures Optimization Reference Manual (November 2007),
  6. 6.
    Wassenberg, J., Middelmann, W., Sanders, P.: An efficient parallel algorithm for graph-based image segmentation (June 2009),
  7. 7.
    Jimenez-Gonzalez, D., Navarro, J., Larriba-Pey, J.: Fast parallel in-memory 64-bit sorting. In: Proceedings of the 2001 International Conference on Supercomputing (15th ICS 2001), Sorrento, Napoli, Italy, pp. 114–122. ACM, New York (2001)Google Scholar
  8. 8.
    an Mey, D., Terboven, C.: Affinity matters! OpenMP on multicore and ccNUMA architectures. In: Parallel Computing: Architectures, Algorithms and Applications, vol. 15, Forschungszentrum Jülich and RWTH Aachen University ( Febuary 2008),
  9. 9.
    Panneton, F., L’Ecuyer, P., Matsumoto, M.: Improved long-period generators based on linear recurrences modulo 2. ACM Transactions on Mathematical Software 32 (2006)Google Scholar
  10. 10.
    Satish, N., Kim, C., Chhugani, J., Nguyen, A., Lee, V., Kim, D., Dubey, P.: Fast sort on CPUs, GPUs and intel MIC architectures. Technical report, Intel (2010),
  11. 11.
    Merrill, D., Grimshaw, A.: Revisiting sorting for GPGPU stream architectures. Technical Report 3, University of Virginia (February 2010),
  12. 12.
    Levinthal, D.: Performance Analysis Guide for Intel Core i7 Processor and Intel Xeon 5500 processors. Intel,
  13. 13.
    Besedin, D.: RightMark memory analyzer, (accessed January 9, 2009)
  14. 14.
    Jacob, B., Ng, S., Wang, D.: Memory systems: cache, DRAM, disk. Morgan Kaufmann, San Francisco (2007)Google Scholar
  15. 15.
    Helman, D., Bader, D., JáJá, J.: A randomized parallel sorting algorithm with an experimental study. J. Parallel Distrib. Comput. 52(1), 1–23 (1998)CrossRefzbMATHGoogle Scholar
  16. 16.
    Wassenberg, J.: Vmcsort demo (May 2011),

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Jan Wassenberg
    • 1
  • Peter Sanders
    • 2
  1. 1.Fraunhofer IOSBEttlingenGermany
  2. 2.Karlsruhe Institute of TechnologyKarlsruheGermany

Personalised recommendations