Engineering Parallel String Sorting

Abstract

We discuss how string sorting algorithms can be parallelized on modern multi-core shared memory machines. As a synthesis of the best sequential string sorting algorithms and successful parallel sorting algorithms for atomic objects, we first propose string sample sort. The algorithm makes effective use of the memory hierarchy, uses additional word level parallelism, and largely avoids branch mispredictions. Then we focus on NUMA architectures, and develop parallel multiway LCP-merge and -mergesort to reduce the number of random memory accesses to remote nodes. Additionally, we parallelize variants of multikey quicksort and radix sort that are also useful in certain situations. As base-case sorter for LCP-aware string sorting we describe sequential LCP-insertion sort which calculates the LCP array and accelerates its insertions using it. Comprehensive experiments on five current multi-core platforms are then reported and discussed. The experiments show that our parallel string sorting implementations scale very well on real-world inputs and modern machines.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Notes

  1. 1.

    We are currently working on this for a final version of this paper.

  2. 2.

    See http://panthema.net/2013/pmbw/ for parallel memory bandwidth experiments.

  3. 3.

    The entropy \(\frac{1}{n}\sum _i\log \frac{n}{|b_i|}\) can be used to define the amount of information gained by a set of splitters. The bucket sizes \(b_i\) can be estimated using their size within the sample.

  4. 4.

    http://panthema.net/2013/malloc_count/, by one of the authors.

References

  1. 1.

    Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms (JDA) 2(1), 53–86 (2004)

    MathSciNet  Article  MATH  Google Scholar 

  2. 2.

    Akiba, T.: Parallel string radix sort in C++. http://github.com/iwiwi/parallel-string-radix-sort (2011). Git repository accessed November 2012

  3. 3.

    Akl, S.G., Santoro, N.: Optimal parallel merging and sorting without memory conflicts. IEEE Trans. Comput. 100(11), 1367–1369 (1987)

    MathSciNet  Article  Google Scholar 

  4. 4.

    Amir, A., Franceschini, G., Grossi, R., Kopelowitz, T., Lewenstein, M., Lewenstein, N.: Managing unbounded-length keys in comparison-driven data structures with applications to online indexing. SIAM J. Comput. 43(4), 1396–1416 (2014)

    MathSciNet  Article  MATH  Google Scholar 

  5. 5.

    Andersson, A., Nilsson, S.: Implementing radixsort. J. Exp. Algorithmics (JEA) 3, 7 (1998)

    MathSciNet  Article  MATH  Google Scholar 

  6. 6.

    Bentley, J.L., Sedgewick, R.: Fast algorithms for sorting and searching strings. In: ACM (ed.) 8th Symposium on Discrete Algorithms (SODA), pp. 360–369 (1997)

  7. 7.

    Bingmann, T., Sanders, P.: Parallel string sample sort. In: 21th European Symposium on Algorithms (ESA), no. 8125 in LNCS. Springer-Verlag (2013)

  8. 8.

    Blelloch, G.E., Leiserson, C.E., Maggs, B.M., Plaxton, C.G., Smith, S.J., Zagha, M.: A comparison of sorting algorithms for the connection machine CM-2. In: 3rd Symposium on Parallel Algorithms and Architectures (SPAA), pp. 3–16. ACM (1991)

  9. 9.

    Brent, R.P.: The parallel evaluation of general arithmetic expressions. J. ACM (JACM) 21(2), 201–206 (1974)

    MathSciNet  Article  MATH  Google Scholar 

  10. 10.

    Cole, R.: Parallel merge sort. SIAM J. Comput. 17(4), 770–785 (1988)

    MathSciNet  Article  MATH  Google Scholar 

  11. 11.

    Dementiev, R., Kettner, L., Mehnert, J., Sanders, P.: Engineering a sorted list data structure for 32 bit keys. In: 6th Workshop on Algorithm Engineering & Experiments (ALENEX), pp. 142–151. SIAM (2004)

  12. 12.

    Eberle, A.: Parallel multiway LCP-mergesort. Bachelor Thesis, Karlsruhe Institute of Technology, to appear (2014)

  13. 13.

    Frazer, W.D., McKellar, A.C.: Samplesort: a sampling approach to minimal storage tree sorting. J. ACM (JACM) 17(3), 496–507 (1970)

    MathSciNet  Article  MATH  Google Scholar 

  14. 14.

    Hagerup, T.: Optimal parallel string algorithms: sorting, merging and computing the minimum. In: 16th ACM Symposium on Theory of Computing (STOC), pp. 382–391 (1994)

  15. 15.

    Hoare, C.A.R.: Quicksort. Comput. J. 5(1), 10–16 (1962)

    MathSciNet  Article  MATH  Google Scholar 

  16. 16.

    Kent, C., Lewenstein, M., Sheinwald, D.: On demand string sorting over unbounded alphabets. Theor. Comput. Sci.nce 426, 66–74 (2012)

    MathSciNet  Article  MATH  Google Scholar 

  17. 17.

    Knöpfle, S.D.: String samplesort. Bachelor Thesis, Karlsruhe Institute of Technology, in German (2012)

  18. 18.

    Knuth, D.E.: The Art of Computer Programming, Volume 3: Sorting And Searching, 2nd edn. Addison Wesley Longman Publishing Co., Inc, Redwood (1998)

    Google Scholar 

  19. 19.

    Kogge, P.M., Stone, H.S.: A parallel algorithm for the efficient solution of a general class of recurrence equations. IEEE Trans. Comput. 100(8), 786–793 (1973)

    MathSciNet  Article  MATH  Google Scholar 

  20. 20.

    Kärkkäinen, J., Rantala, T.: Engineering radix sort for strings. In: 16th International Conference on String Processing and Information Retrieval (SPIRE), no. 5280 in LNCS, pp. 3–14. Springer-Verlag (2009)

  21. 21.

    McIlroy, P.M., Bostic, K., McIlroy, M.D.: Engineering radix sort. Comput. Syst. 6(1), 5–27 (1993)

    Google Scholar 

  22. 22.

    Mehlhorn, K., Sanders, P.: Scanning multiple sequences via cache memory. Algorithmica 35(1), 75–93 (2003)

    MathSciNet  Article  MATH  Google Scholar 

  23. 23.

    Ng, W., Kakehi, K.: Cache efficient radix sort for string sorting. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. E90–A(2), 457–466 (2007)

    Article  Google Scholar 

  24. 24.

    Ng, W., Kakehi, K.: Merging string sequences by longest common prefixes. IPSJ Digit. Cour. 4, 69–78 (2008)

    Article  Google Scholar 

  25. 25.

    Rantala, T.: Library of string sorting algorithms in C++. http://github.com/rantala/string-sorting (2007). Git repository accessed November 2012

  26. 26.

    Sanders, P.: Fast priority queues for cached memory. J. Exp. Algorithmics (JEA) 5, 7 (2000)

    Article  MATH  Google Scholar 

  27. 27.

    Sanders, P., Winkel, S.: Super scalar sample sort. In: 12th European Symposium on Algorithms (ESA), LNCS, vol. 3221, pp. 784–796. Springer-Verlag (2004)

  28. 28.

    Shamsundar, N.: A fast, stable implementation of mergesort for sorting text files. http://code.google.com/p/lcp-merge-string-sort (2009). Source downloaded November 2012

  29. 29.

    Singler, J., Sanders, P., Putze, F.: MCSTL: The multi-core standard template library. In: Euro-Par 2007 Parallel Processing, no. 4641 in LNCS, pp. 682–694. Springer-Verlag (2007)

  30. 30.

    Sinha, R., Wirth, A.: Engineering burstsort: toward fast in-place string sorting. J. Exp. Algorithmics (JEA) 15, 1–24 (2010)

    MathSciNet  MATH  Google Scholar 

  31. 31.

    Sinha, R., Zobel, J.: Cache-conscious sorting of large sets of strings with dynamic tries. J. Exp. Algorithmics (JEA) 9, 1–31 (2004)

    MathSciNet  MATH  Google Scholar 

  32. 32.

    Sinha, R., Zobel, J., Ring, D.: Cache-efficient string sorting using copying. J. Exp. Algorithmics (JEA) 11, 1–32 (2007)

    MathSciNet  MATH  Google Scholar 

  33. 33.

    Tsigas, P., Zhang, Y.: A simple, fast parallel implementation of quicksort and its performance evaluation on SUN enterprise 10000. In: 11th Euromicro Workshop on Parallel, Distributed and Network-Based Processing (PDP), pp. 372–381. IEEE Computer Society (2003)

  34. 34.

    Wassenberg, J., Sanders, P.: Engineering a multi-core radix sort. In: Euro-Par 2011 Parallel Processing, no. 6853 in LNCS, pp. 160–169. Springer-Verlag (2011)

  35. 35.

    Yang, M.C.K., Huang, J.S., Chow, Y.C.: Optimal parallel sorting scheme by order statistics. SIAM J. Comput. 16(6), 990–1003 (1987)

    MathSciNet  Article  MATH  Google Scholar 

Download references

Acknowledgments

We would like the thank the anonymous reviewer for extraordinarily thorough checking of our algorithms and proofs, and for kind suggestions on how to improve the paper.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Timo Bingmann.

Appendix: Performance of Sequential Algorithms

Appendix: Performance of Sequential Algorithms

We collected many sequential string sorting algorithms in our test framework. We believe it to contain virtually every string sorting implementation publicly available.

Table 4 Description of selected sequential string sorting algorithms

The algorithm library by Rantala [25] contains 37 versions of radix sort (in-place, out-of-place, and one-pass with various dynamic memory allocation schemes), 26 variants of multikey quicksort (with caching, block-based, different dynamic memory allocation and SIMD instructions), 10 different funnelsorts, 38 implementations of burstsort (again with different dynamic memory managements), and 29 mergesorts (with losertree and LCP caching variants). In total these are 140 original implementation variants, all of high quality.

The other main source of string sorting implementations are the publications of Sinha. We included the original burstsort implementations (one with dynamically growing arrays and one with linked lists), and 9 versions of copy-burstsort. The original copy-burstsort code was written for 32-bit machines, and we modified it to work with 64-bit pointers.

We also incorporated the implementations of CRadix sort and LCP-Mergesort by Ng, and the original multikey quicksort code by Bentley and Sedgewick.

Of the 203 different sequential string sorting variants, we selected the thirteen implementations listed in Table 4 to represent both the fastest ones in a preliminary test and each of the basic algorithms from Sect. 3. The thirteen algorithms were run on all our five test platforms on small portions of the test instances described in Sect. 7. Tables 5 and 6 show the results, with the fastest algorithm’s time highlighted with bold text.

Table 5 Run time of sequential algorithms on IntelE5 and AMD48 in seconds, and peak memory usage of algorithms on IntelE5
Table 6 Run time of sequential algorithms on AMD16, Inteli7, and IntelX5 in seconds

Cells in the tables without value indicate a program error, out-of-memory exceptions or extremely long runtime. This was always the case for the copy-burstsort variants on the GOV2 and Wikipedia inputs, because they perform excessive caching of characters. On Inteli7, some implementations required more memory than the available 12 GiB to sort the 4 GiB prefixes of Random and URLs.

Table 7 Absolute run time of parallel and best sequential algorithms on IntelE5 in seconds, median of 1–3 runs

Over all run instances and platforms, multikey quicksort with caching of eight characters was fastest on 18 pairs, winning the most tests. It was fastest on all platforms for both URL list and GOV2 prefixes, except URL on IntelX5, and on all large instances on AMD48 and AMD16. However, for the NoDup input, short strings with large alphabet, the highly tuned radix sort radixR_CE7 consistently outperformed mkqs_cache8 on all platforms by a small margin. The copy-burstsort variant fbC_burstsort was most efficient on all platforms for DNA, which are short strings with small alphabet. For Random strings and Wikipedia suffixes, mkqs_cache8 or radixR_CE7 was fastest, depending on the platforms memory bandwidth and sequential processing speed. Our own sequential implementations of \(\mathrm{S}^5\) were never the fastest, but they consistently fall in the middle field, without any outliers. This is expected, since \(\mathrm{S}^5\) is mainly designed to be used as an efficient top-level parallel algorithm, and to be conservative with memory bandwidth, since this is the limiting factor for data-intensive multi-core applications.

We also measured the peak memory usage of the sequential implementations using a heap and stack profiling toolFootnote 4 for the selected sequential test instances. The bottom of Table 5 shows the results in MiB, excluding the string data array and the string pointer array (we only have 64-bit systems, so pointers are eight bytes). We must note that the profiler considers allocated virtual memory, which may not be identical to the amount of physical memory actually used. From the table we plainly see, that the more caching an implementation does, the higher its peak memory allocation. However, the memory usage of fbC_burstsort is extreme, even if one considers that the implementation can deallocate and recreate the string data from the burst trie. The lower memory usage of fbC_burstsort for Random is due to the high percentage of characters stored implicitly in the trie structure. The sCPL_burstsort and burstsortA variants bring the memory requirement down somewhat, but they are still high. Some radixsort variants and, most notable, mkqs_cache8 are also not particularly memory conservative, again due to caching. Our sequential \(\mathrm{S}^5\) implementation fares well in this comparison because it does no caching and permutes the string pointers in-place (Note that radixsort is used for small string subsets in sequential \(\mathrm{S}^5\). This is due to the development history: we finished sequential \(\mathrm{S}^5\) before focusing on caching multikey quicksort). For sorting with little extra memory, plain multikey quicksort is still a good choice (Tables 7, 8, 9, 10, 11, 12, 13).

Table 8 Absolute run time of parallel and best sequential algorithms on AMD48 in seconds, median of 1–3 runs
Table 9 Absolute run time of parallel and best sequential algorithms on AMD16 in seconds, median of 1–3 runs
Table 10 Absolute run time of parallel and best sequential algorithms on Inteli7 in seconds, median of fifteen runs, larger test instances
Table 11 Absolute run time of parallel and best sequential algorithms on Inteli7 in seconds, median of fifteen runs, smaller test instances
Table 12 Absolute run time of parallel and best sequential algorithms on IntelX5 in seconds, median of fifteen runs, larger test instances
Table 13 Absolute run time of parallel and best sequential algorithms on IntelX5 in seconds, median of fifteen runs, smaller test instances

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Bingmann, T., Eberle, A. & Sanders, P. Engineering Parallel String Sorting. Algorithmica 77, 235–286 (2017). https://doi.org/10.1007/s00453-015-0071-1

Download citation

Keywords

  • Parallel string sorting
  • String sorting
  • Sample sort
  • Merge sort
  • LCP-merge sort
  • LCP-insertion sort
  • Super scalar string sample sort