Super Scalar Sample Sort

  • Peter Sanders
  • Sebastian Winkel
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3221)


Sample sort, a generalization of quicksort that partitions the input into many pieces, is known as the best practical comparison based sorting algorithm for distributed memory parallel computers. We show that sample sort is also useful on a single processor. The main algorithmic insight is that element comparisons can be decoupled from expensive conditional branching using predicated instructions. This transformation facilitates optimizations like loop unrolling and software pipelining. The final implementation, albeit cache efficient, is limited by a linear number of memory accesses rather than the \(\mathcal{O}\!\left(n\log n\right)\) comparisons. On an Itanium 2 machine, we obtain a speedup of up to 2 over std::sort from the GCC STL library, which is known as one of the fastest available quicksort implementations.


Sorting Algorithm Memory Hierarchy Loop Body Software Pipeline Conditional Branch 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Agarwal, R.: A super scalar sort algorithm for RISC processors. In: ACM SIGMOD Int. Conf. on Management of Data, pp. 240–246 (1996)Google Scholar
  2. 2.
    Aggarwal, A., Vitter, J.S.: The input/output complexity of sorting and related problems. Communications of the ACM 31(9), 1116–1127 (1988)CrossRefMathSciNetGoogle Scholar
  3. 3.
    Allan, V.H., Jones, R.B., Lee, R.M., Allan, S.J.: Software Pipelining. Computing Surveys 27(3), 367–432 (1995)CrossRefGoogle Scholar
  4. 4.
    Blelloch, G.E., Leiserson, C.E., Maggs, B.M., Plaxton, C.G., Smith, S.J., Zagha, M.: A comparison of sorting algorithms for the connection machine CM-2. In: ACM Symposium on Parallel Architectures and Algorithms, pp. 3–16 (1991)Google Scholar
  5. 5.
    Brodal, G.S., Fagerberg, R., Vinther, K.: Engineering a cache-oblivious sorting algorithm. In: 6th Workshop on Algorithm Engineering and Experiments (2004)Google Scholar
  6. 6.
    Dulong, C., Krishnaiyer, R., Kulkarni, D., Lavery, D., Li, W., Ng, J., Sehr, D.: An Overview of the Intel® IA-64 Compiler. Intel Technology Journal (Q4) (1999)Google Scholar
  7. 7.
    Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-oblivious algorithms. In: 40th Symposium on Foundations of Computer Science, pp. 285–298 (1999)Google Scholar
  8. 8.
    Hennessy, J.L., Patterson, D.A.: Computer Architecture a Quantitative Approach, 3rd edn. Morgan Kaufmann, San Francisco (2002)zbMATHGoogle Scholar
  9. 9.
    Hoare, C.A.R.: Quicksort. Communication of the ACM 4(7), 321 (1961)CrossRefGoogle Scholar
  10. 10.
    Intel. Intel® Itanium® 2 Processor Reference Manual for Software Development and Optimization (April 2003) Google Scholar
  11. 11.
    Jiminez-Gonzalez, D., Larriba-Pey, J.-L., Navarro, J.J.: Algorithms for Memory Hierarchies. In: Meyer, U., Sanders, P., Sibeyn, J.F. (eds.) Algorithms for Memory Hierarchies. LNCS, vol. 2625, pp. 171–192. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  12. 12.
    Knuth, D.E.: The Art of Computer Programming— Sortingand Searching, 2nd edn., vol. 3. Addison-Wesley, Reading (1998)Google Scholar
  13. 13.
    LaMarca, A., Ladner, R.E.: The influence of caches on the performance of sorting. In: 8th Symposium on Discrete Algorithm, pp. 370–379 (1997)Google Scholar
  14. 14.
    Martínez, C., Roura, S.: Optimal sampling strategies in Quicksort and Quickselect. SIAM Journal on Computing 31(3), 683–705 (2002)CrossRefGoogle Scholar
  15. 15.
    Mehlhorn, K., Sanders, P.: Scanning multiple sequences via cache memory. Algorithmica 35(1), 75–93 (2003)zbMATHCrossRefMathSciNetGoogle Scholar
  16. 16.
    Muchnick, S.S.: Advanced Compiler Design and Implementation. Morgan Kaufmann Publishers, San Francisco (1997)Google Scholar
  17. 17.
    Musser, D.R.: Introspective sorting and selection algorithms. Softw. Pract. Exper. 27(8), 983–993 (1997)CrossRefGoogle Scholar
  18. 18.
    Nyberg, C., Barclay, T., Cvetanovic, Z., Gray, J., Lomet, D.: AlphaSort: A RISC machine sort. In: SIGMOD, pp. 233–242 (1994)Google Scholar
  19. 19.
    Rahman, N.: Algorithms for Memory Hierarchies. In: Meyer, U., Sanders, P., Sibeyn, J.F. (eds.) Algorithms for Memory Hierarchies. LNCS, vol. 2625, pp. 171–192. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  20. 20.
    Ranade, A., Kothari, S.C., Udupa, U.R.U.: Register efficient mergesorting. In: Prasanna, V.K., Vajapeyam, S., Valero, M. (eds.) HiPC 2000. LNCS, vol. 1970, pp. 96–103. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  21. 21.
    Riedlinger, R., Grutkowski, T.: The High Bandwidth, 256KB 2nd Level Cache on an ItaniumTM Microprocessor. In: Proceedings of the IEEE International Solid-State Circuits Conference, San Francisco (February 2002)Google Scholar
  22. 22.
    Sanders, P.: Fast priority queues for cached memory. ACM Journal of Experimental Algorithmics 5 (2000)Google Scholar
  23. 23.
    Sen, S., Chatterjee, S.: Towards a theory of cache-efficient algorithms. In: 11th ACM Symposium of Discrete Algorithms, pp. 829–838 (2000)Google Scholar
  24. 24.
    Wickremesinghe, R., Arge, L., Chase, J.S., Vitter, J.S.: Efficient sorting using registers and caches. ACM Journal of Experimental Algorithmics 7(9) (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Peter Sanders
    • 1
  • Sebastian Winkel
    • 2
  1. 1.Max Planck Institut für InformatikSaarbrückenGermany
  2. 2.Chair for Prog. Lang. and Compiler ConstructionSaarland UniversitySaarbrückenGermany

Personalised recommendations