Abstract
We discuss how string sorting algorithms can be parallelized on modern multi-core shared memory machines. As a synthesis of the best sequential string sorting algorithms and successful parallel sorting algorithms for atomic objects, we first propose string sample sort. The algorithm makes effective use of the memory hierarchy, uses additional word level parallelism, and largely avoids branch mispredictions. Then we focus on NUMA architectures, and develop parallel multiway LCP-merge and -mergesort to reduce the number of random memory accesses to remote nodes. Additionally, we parallelize variants of multikey quicksort and radix sort that are also useful in certain situations. As base-case sorter for LCP-aware string sorting we describe sequential LCP-insertion sort which calculates the LCP array and accelerates its insertions using it. Comprehensive experiments on five current multi-core platforms are then reported and discussed. The experiments show that our parallel string sorting implementations scale very well on real-world inputs and modern machines.
This is a preview of subscription content,
to check access.







Similar content being viewed by others
Notes
We are currently working on this for a final version of this paper.
See http://panthema.net/2013/pmbw/ for parallel memory bandwidth experiments.
The entropy \(\frac{1}{n}\sum _i\log \frac{n}{|b_i|}\) can be used to define the amount of information gained by a set of splitters. The bucket sizes \(b_i\) can be estimated using their size within the sample.
http://panthema.net/2013/malloc_count/, by one of the authors.
References
Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms (JDA) 2(1), 53–86 (2004)
Akiba, T.: Parallel string radix sort in C++. http://github.com/iwiwi/parallel-string-radix-sort (2011). Git repository accessed November 2012
Akl, S.G., Santoro, N.: Optimal parallel merging and sorting without memory conflicts. IEEE Trans. Comput. 100(11), 1367–1369 (1987)
Amir, A., Franceschini, G., Grossi, R., Kopelowitz, T., Lewenstein, M., Lewenstein, N.: Managing unbounded-length keys in comparison-driven data structures with applications to online indexing. SIAM J. Comput. 43(4), 1396–1416 (2014)
Andersson, A., Nilsson, S.: Implementing radixsort. J. Exp. Algorithmics (JEA) 3, 7 (1998)
Bentley, J.L., Sedgewick, R.: Fast algorithms for sorting and searching strings. In: ACM (ed.) 8th Symposium on Discrete Algorithms (SODA), pp. 360–369 (1997)
Bingmann, T., Sanders, P.: Parallel string sample sort. In: 21th European Symposium on Algorithms (ESA), no. 8125 in LNCS. Springer-Verlag (2013)
Blelloch, G.E., Leiserson, C.E., Maggs, B.M., Plaxton, C.G., Smith, S.J., Zagha, M.: A comparison of sorting algorithms for the connection machine CM-2. In: 3rd Symposium on Parallel Algorithms and Architectures (SPAA), pp. 3–16. ACM (1991)
Brent, R.P.: The parallel evaluation of general arithmetic expressions. J. ACM (JACM) 21(2), 201–206 (1974)
Cole, R.: Parallel merge sort. SIAM J. Comput. 17(4), 770–785 (1988)
Dementiev, R., Kettner, L., Mehnert, J., Sanders, P.: Engineering a sorted list data structure for 32 bit keys. In: 6th Workshop on Algorithm Engineering & Experiments (ALENEX), pp. 142–151. SIAM (2004)
Eberle, A.: Parallel multiway LCP-mergesort. Bachelor Thesis, Karlsruhe Institute of Technology, to appear (2014)
Frazer, W.D., McKellar, A.C.: Samplesort: a sampling approach to minimal storage tree sorting. J. ACM (JACM) 17(3), 496–507 (1970)
Hagerup, T.: Optimal parallel string algorithms: sorting, merging and computing the minimum. In: 16th ACM Symposium on Theory of Computing (STOC), pp. 382–391 (1994)
Hoare, C.A.R.: Quicksort. Comput. J. 5(1), 10–16 (1962)
Kent, C., Lewenstein, M., Sheinwald, D.: On demand string sorting over unbounded alphabets. Theor. Comput. Sci.nce 426, 66–74 (2012)
Knöpfle, S.D.: String samplesort. Bachelor Thesis, Karlsruhe Institute of Technology, in German (2012)
Knuth, D.E.: The Art of Computer Programming, Volume 3: Sorting And Searching, 2nd edn. Addison Wesley Longman Publishing Co., Inc, Redwood (1998)
Kogge, P.M., Stone, H.S.: A parallel algorithm for the efficient solution of a general class of recurrence equations. IEEE Trans. Comput. 100(8), 786–793 (1973)
Kärkkäinen, J., Rantala, T.: Engineering radix sort for strings. In: 16th International Conference on String Processing and Information Retrieval (SPIRE), no. 5280 in LNCS, pp. 3–14. Springer-Verlag (2009)
McIlroy, P.M., Bostic, K., McIlroy, M.D.: Engineering radix sort. Comput. Syst. 6(1), 5–27 (1993)
Mehlhorn, K., Sanders, P.: Scanning multiple sequences via cache memory. Algorithmica 35(1), 75–93 (2003)
Ng, W., Kakehi, K.: Cache efficient radix sort for string sorting. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. E90–A(2), 457–466 (2007)
Ng, W., Kakehi, K.: Merging string sequences by longest common prefixes. IPSJ Digit. Cour. 4, 69–78 (2008)
Rantala, T.: Library of string sorting algorithms in C++. http://github.com/rantala/string-sorting (2007). Git repository accessed November 2012
Sanders, P.: Fast priority queues for cached memory. J. Exp. Algorithmics (JEA) 5, 7 (2000)
Sanders, P., Winkel, S.: Super scalar sample sort. In: 12th European Symposium on Algorithms (ESA), LNCS, vol. 3221, pp. 784–796. Springer-Verlag (2004)
Shamsundar, N.: A fast, stable implementation of mergesort for sorting text files. http://code.google.com/p/lcp-merge-string-sort (2009). Source downloaded November 2012
Singler, J., Sanders, P., Putze, F.: MCSTL: The multi-core standard template library. In: Euro-Par 2007 Parallel Processing, no. 4641 in LNCS, pp. 682–694. Springer-Verlag (2007)
Sinha, R., Wirth, A.: Engineering burstsort: toward fast in-place string sorting. J. Exp. Algorithmics (JEA) 15, 1–24 (2010)
Sinha, R., Zobel, J.: Cache-conscious sorting of large sets of strings with dynamic tries. J. Exp. Algorithmics (JEA) 9, 1–31 (2004)
Sinha, R., Zobel, J., Ring, D.: Cache-efficient string sorting using copying. J. Exp. Algorithmics (JEA) 11, 1–32 (2007)
Tsigas, P., Zhang, Y.: A simple, fast parallel implementation of quicksort and its performance evaluation on SUN enterprise 10000. In: 11th Euromicro Workshop on Parallel, Distributed and Network-Based Processing (PDP), pp. 372–381. IEEE Computer Society (2003)
Wassenberg, J., Sanders, P.: Engineering a multi-core radix sort. In: Euro-Par 2011 Parallel Processing, no. 6853 in LNCS, pp. 160–169. Springer-Verlag (2011)
Yang, M.C.K., Huang, J.S., Chow, Y.C.: Optimal parallel sorting scheme by order statistics. SIAM J. Comput. 16(6), 990–1003 (1987)
Acknowledgments
We would like the thank the anonymous reviewer for extraordinarily thorough checking of our algorithms and proofs, and for kind suggestions on how to improve the paper.
Author information
Authors and Affiliations
Corresponding author
Appendix: Performance of Sequential Algorithms
Appendix: Performance of Sequential Algorithms
We collected many sequential string sorting algorithms in our test framework. We believe it to contain virtually every string sorting implementation publicly available.
The algorithm library by Rantala [25] contains 37 versions of radix sort (in-place, out-of-place, and one-pass with various dynamic memory allocation schemes), 26 variants of multikey quicksort (with caching, block-based, different dynamic memory allocation and SIMD instructions), 10 different funnelsorts, 38 implementations of burstsort (again with different dynamic memory managements), and 29 mergesorts (with losertree and LCP caching variants). In total these are 140 original implementation variants, all of high quality.
The other main source of string sorting implementations are the publications of Sinha. We included the original burstsort implementations (one with dynamically growing arrays and one with linked lists), and 9 versions of copy-burstsort. The original copy-burstsort code was written for 32-bit machines, and we modified it to work with 64-bit pointers.
We also incorporated the implementations of CRadix sort and LCP-Mergesort by Ng, and the original multikey quicksort code by Bentley and Sedgewick.
Of the 203 different sequential string sorting variants, we selected the thirteen implementations listed in Table 4 to represent both the fastest ones in a preliminary test and each of the basic algorithms from Sect. 3. The thirteen algorithms were run on all our five test platforms on small portions of the test instances described in Sect. 7. Tables 5 and 6 show the results, with the fastest algorithm’s time highlighted with bold text.
Cells in the tables without value indicate a program error, out-of-memory exceptions or extremely long runtime. This was always the case for the copy-burstsort variants on the GOV2 and Wikipedia inputs, because they perform excessive caching of characters. On Inteli7, some implementations required more memory than the available 12 GiB to sort the 4 GiB prefixes of Random and URLs.
Over all run instances and platforms, multikey quicksort with caching of eight characters was fastest on 18 pairs, winning the most tests. It was fastest on all platforms for both URL list and GOV2 prefixes, except URL on IntelX5, and on all large instances on AMD48 and AMD16. However, for the NoDup input, short strings with large alphabet, the highly tuned radix sort radixR_CE7 consistently outperformed mkqs_cache8 on all platforms by a small margin. The copy-burstsort variant fbC_burstsort was most efficient on all platforms for DNA, which are short strings with small alphabet. For Random strings and Wikipedia suffixes, mkqs_cache8 or radixR_CE7 was fastest, depending on the platforms memory bandwidth and sequential processing speed. Our own sequential implementations of \(\mathrm{S}^5\) were never the fastest, but they consistently fall in the middle field, without any outliers. This is expected, since \(\mathrm{S}^5\) is mainly designed to be used as an efficient top-level parallel algorithm, and to be conservative with memory bandwidth, since this is the limiting factor for data-intensive multi-core applications.
We also measured the peak memory usage of the sequential implementations using a heap and stack profiling toolFootnote 4 for the selected sequential test instances. The bottom of Table 5 shows the results in MiB, excluding the string data array and the string pointer array (we only have 64-bit systems, so pointers are eight bytes). We must note that the profiler considers allocated virtual memory, which may not be identical to the amount of physical memory actually used. From the table we plainly see, that the more caching an implementation does, the higher its peak memory allocation. However, the memory usage of fbC_burstsort is extreme, even if one considers that the implementation can deallocate and recreate the string data from the burst trie. The lower memory usage of fbC_burstsort for Random is due to the high percentage of characters stored implicitly in the trie structure. The sCPL_burstsort and burstsortA variants bring the memory requirement down somewhat, but they are still high. Some radixsort variants and, most notable, mkqs_cache8 are also not particularly memory conservative, again due to caching. Our sequential \(\mathrm{S}^5\) implementation fares well in this comparison because it does no caching and permutes the string pointers in-place (Note that radixsort is used for small string subsets in sequential \(\mathrm{S}^5\). This is due to the development history: we finished sequential \(\mathrm{S}^5\) before focusing on caching multikey quicksort). For sorting with little extra memory, plain multikey quicksort is still a good choice (Tables 7, 8, 9, 10, 11, 12, 13).
Rights and permissions
About this article
Cite this article
Bingmann, T., Eberle, A. & Sanders, P. Engineering Parallel String Sorting. Algorithmica 77, 235–286 (2017). https://doi.org/10.1007/s00453-015-0071-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-015-0071-1