Engineering Parallel String Sorting

Bingmann, Timo; Eberle, Andreas; Sanders, Peter

doi:10.1007/s00453-015-0071-1

Engineering Parallel String Sorting

Published: 18 September 2015

Volume 77, pages 235–286, (2017)
Cite this article

Algorithmica Aims and scope Submit manuscript

Timo Bingmann¹,
Andreas Eberle¹ &
Peter Sanders¹

852 Accesses
12 Citations
Explore all metrics

Abstract

We discuss how string sorting algorithms can be parallelized on modern multi-core shared memory machines. As a synthesis of the best sequential string sorting algorithms and successful parallel sorting algorithms for atomic objects, we first propose string sample sort. The algorithm makes effective use of the memory hierarchy, uses additional word level parallelism, and largely avoids branch mispredictions. Then we focus on NUMA architectures, and develop parallel multiway LCP-merge and -mergesort to reduce the number of random memory accesses to remote nodes. Additionally, we parallelize variants of multikey quicksort and radix sort that are also useful in certain situations. As base-case sorter for LCP-aware string sorting we describe sequential LCP-insertion sort which calculates the LCP array and accelerates its insertions using it. Comprehensive experiments on five current multi-core platforms are then reported and discussed. The experiments show that our parallel string sorting implementations scale very well on real-world inputs and modern machines.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parallel String Sample Sort

How Many CPU Cores is an FPGA Worth? Lessons Learned from Accelerating String Sorting on a CPU-FPGA System

Article Open access 23 September 2021

Even Faster Sorting of (Not Only) Integers

Notes

We are currently working on this for a final version of this paper.
See http://panthema.net/2013/pmbw/ for parallel memory bandwidth experiments.
The entropy \(\frac{1}{n}\sum _i\log \frac{n}{|b_i|}\) can be used to define the amount of information gained by a set of splitters. The bucket sizes \(b_i\) can be estimated using their size within the sample.
http://panthema.net/2013/malloc_count/, by one of the authors.

References

Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms (JDA) 2(1), 53–86 (2004)
Article MathSciNet MATH Google Scholar
Akiba, T.: Parallel string radix sort in C++. http://github.com/iwiwi/parallel-string-radix-sort (2011). Git repository accessed November 2012
Akl, S.G., Santoro, N.: Optimal parallel merging and sorting without memory conflicts. IEEE Trans. Comput. 100(11), 1367–1369 (1987)
Article MathSciNet Google Scholar
Amir, A., Franceschini, G., Grossi, R., Kopelowitz, T., Lewenstein, M., Lewenstein, N.: Managing unbounded-length keys in comparison-driven data structures with applications to online indexing. SIAM J. Comput. 43(4), 1396–1416 (2014)
Article MathSciNet MATH Google Scholar
Andersson, A., Nilsson, S.: Implementing radixsort. J. Exp. Algorithmics (JEA) 3, 7 (1998)
Article MathSciNet MATH Google Scholar
Bentley, J.L., Sedgewick, R.: Fast algorithms for sorting and searching strings. In: ACM (ed.) 8th Symposium on Discrete Algorithms (SODA), pp. 360–369 (1997)
Bingmann, T., Sanders, P.: Parallel string sample sort. In: 21th European Symposium on Algorithms (ESA), no. 8125 in LNCS. Springer-Verlag (2013)
Blelloch, G.E., Leiserson, C.E., Maggs, B.M., Plaxton, C.G., Smith, S.J., Zagha, M.: A comparison of sorting algorithms for the connection machine CM-2. In: 3rd Symposium on Parallel Algorithms and Architectures (SPAA), pp. 3–16. ACM (1991)
Brent, R.P.: The parallel evaluation of general arithmetic expressions. J. ACM (JACM) 21(2), 201–206 (1974)
Article MathSciNet MATH Google Scholar
Cole, R.: Parallel merge sort. SIAM J. Comput. 17(4), 770–785 (1988)
Article MathSciNet MATH Google Scholar
Dementiev, R., Kettner, L., Mehnert, J., Sanders, P.: Engineering a sorted list data structure for 32 bit keys. In: 6th Workshop on Algorithm Engineering & Experiments (ALENEX), pp. 142–151. SIAM (2004)
Eberle, A.: Parallel multiway LCP-mergesort. Bachelor Thesis, Karlsruhe Institute of Technology, to appear (2014)
Frazer, W.D., McKellar, A.C.: Samplesort: a sampling approach to minimal storage tree sorting. J. ACM (JACM) 17(3), 496–507 (1970)
Article MathSciNet MATH Google Scholar
Hagerup, T.: Optimal parallel string algorithms: sorting, merging and computing the minimum. In: 16th ACM Symposium on Theory of Computing (STOC), pp. 382–391 (1994)
Hoare, C.A.R.: Quicksort. Comput. J. 5(1), 10–16 (1962)
Article MathSciNet MATH Google Scholar
Kent, C., Lewenstein, M., Sheinwald, D.: On demand string sorting over unbounded alphabets. Theor. Comput. Sci.nce 426, 66–74 (2012)
Article MathSciNet MATH Google Scholar
Knöpfle, S.D.: String samplesort. Bachelor Thesis, Karlsruhe Institute of Technology, in German (2012)
Knuth, D.E.: The Art of Computer Programming, Volume 3: Sorting And Searching, 2nd edn. Addison Wesley Longman Publishing Co., Inc, Redwood (1998)
Google Scholar
Kogge, P.M., Stone, H.S.: A parallel algorithm for the efficient solution of a general class of recurrence equations. IEEE Trans. Comput. 100(8), 786–793 (1973)
Article MathSciNet MATH Google Scholar
Kärkkäinen, J., Rantala, T.: Engineering radix sort for strings. In: 16th International Conference on String Processing and Information Retrieval (SPIRE), no. 5280 in LNCS, pp. 3–14. Springer-Verlag (2009)
McIlroy, P.M., Bostic, K., McIlroy, M.D.: Engineering radix sort. Comput. Syst. 6(1), 5–27 (1993)
Google Scholar
Mehlhorn, K., Sanders, P.: Scanning multiple sequences via cache memory. Algorithmica 35(1), 75–93 (2003)
Article MathSciNet MATH Google Scholar
Ng, W., Kakehi, K.: Cache efficient radix sort for string sorting. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. E90–A(2), 457–466 (2007)
Article Google Scholar
Ng, W., Kakehi, K.: Merging string sequences by longest common prefixes. IPSJ Digit. Cour. 4, 69–78 (2008)
Article Google Scholar
Rantala, T.: Library of string sorting algorithms in C++. http://github.com/rantala/string-sorting (2007). Git repository accessed November 2012
Sanders, P.: Fast priority queues for cached memory. J. Exp. Algorithmics (JEA) 5, 7 (2000)
Article MATH Google Scholar
Sanders, P., Winkel, S.: Super scalar sample sort. In: 12th European Symposium on Algorithms (ESA), LNCS, vol. 3221, pp. 784–796. Springer-Verlag (2004)
Shamsundar, N.: A fast, stable implementation of mergesort for sorting text files. http://code.google.com/p/lcp-merge-string-sort (2009). Source downloaded November 2012
Singler, J., Sanders, P., Putze, F.: MCSTL: The multi-core standard template library. In: Euro-Par 2007 Parallel Processing, no. 4641 in LNCS, pp. 682–694. Springer-Verlag (2007)
Sinha, R., Wirth, A.: Engineering burstsort: toward fast in-place string sorting. J. Exp. Algorithmics (JEA) 15, 1–24 (2010)
MathSciNet MATH Google Scholar
Sinha, R., Zobel, J.: Cache-conscious sorting of large sets of strings with dynamic tries. J. Exp. Algorithmics (JEA) 9, 1–31 (2004)
MathSciNet MATH Google Scholar
Sinha, R., Zobel, J., Ring, D.: Cache-efficient string sorting using copying. J. Exp. Algorithmics (JEA) 11, 1–32 (2007)
MathSciNet MATH Google Scholar
Tsigas, P., Zhang, Y.: A simple, fast parallel implementation of quicksort and its performance evaluation on SUN enterprise 10000. In: 11th Euromicro Workshop on Parallel, Distributed and Network-Based Processing (PDP), pp. 372–381. IEEE Computer Society (2003)
Wassenberg, J., Sanders, P.: Engineering a multi-core radix sort. In: Euro-Par 2011 Parallel Processing, no. 6853 in LNCS, pp. 160–169. Springer-Verlag (2011)
Yang, M.C.K., Huang, J.S., Chow, Y.C.: Optimal parallel sorting scheme by order statistics. SIAM J. Comput. 16(6), 990–1003 (1987)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgments

We would like the thank the anonymous reviewer for extraordinarily thorough checking of our algorithms and proofs, and for kind suggestions on how to improve the paper.

Author information

Authors and Affiliations

Karlsruhe Institute of Technology, Kaiserstraße 12, 76131, Karlsruhe, Germany
Timo Bingmann, Andreas Eberle & Peter Sanders

Authors

Timo Bingmann
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Eberle
View author publications
You can also search for this author in PubMed Google Scholar
Peter Sanders
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Timo Bingmann.

Appendix: Performance of Sequential Algorithms

We collected many sequential string sorting algorithms in our test framework. We believe it to contain virtually every string sorting implementation publicly available.

Table 4 Description of selected sequential string sorting algorithms

Full size table

The algorithm library by Rantala [25] contains 37 versions of radix sort (in-place, out-of-place, and one-pass with various dynamic memory allocation schemes), 26 variants of multikey quicksort (with caching, block-based, different dynamic memory allocation and SIMD instructions), 10 different funnelsorts, 38 implementations of burstsort (again with different dynamic memory managements), and 29 mergesorts (with losertree and LCP caching variants). In total these are 140 original implementation variants, all of high quality.

The other main source of string sorting implementations are the publications of Sinha. We included the original burstsort implementations (one with dynamically growing arrays and one with linked lists), and 9 versions of copy-burstsort. The original copy-burstsort code was written for 32-bit machines, and we modified it to work with 64-bit pointers.

We also incorporated the implementations of CRadix sort and LCP-Mergesort by Ng, and the original multikey quicksort code by Bentley and Sedgewick.

Of the 203 different sequential string sorting variants, we selected the thirteen implementations listed in Table 4 to represent both the fastest ones in a preliminary test and each of the basic algorithms from Sect. 3. The thirteen algorithms were run on all our five test platforms on small portions of the test instances described in Sect. 7. Tables 5 and 6 show the results, with the fastest algorithm’s time highlighted with bold text.

Table 5 Run time of sequential algorithms on IntelE5 and AMD48 in seconds, and peak memory usage of algorithms on IntelE5

Full size table

Table 6 Run time of sequential algorithms on AMD16, Inteli7, and IntelX5 in seconds

Full size table

Cells in the tables without value indicate a program error, out-of-memory exceptions or extremely long runtime. This was always the case for the copy-burstsort variants on the GOV2 and Wikipedia inputs, because they perform excessive caching of characters. On Inteli7, some implementations required more memory than the available 12 GiB to sort the 4 GiB prefixes of Random and URLs.

Table 7 Absolute run time of parallel and best sequential algorithms on IntelE5 in seconds, median of 1–3 runs

Full size table

Over all run instances and platforms, multikey quicksort with caching of eight characters was fastest on 18 pairs, winning the most tests. It was fastest on all platforms for both URL list and GOV2 prefixes, except URL on IntelX5, and on all large instances on AMD48 and AMD16. However, for the NoDup input, short strings with large alphabet, the highly tuned radix sort radixR_CE7 consistently outperformed mkqs_cache8 on all platforms by a small margin. The copy-burstsort variant fbC_burstsort was most efficient on all platforms for DNA, which are short strings with small alphabet. For Random strings and Wikipedia suffixes, mkqs_cache8 or radixR_CE7 was fastest, depending on the platforms memory bandwidth and sequential processing speed. Our own sequential implementations of \(\mathrm{S}^5\) were never the fastest, but they consistently fall in the middle field, without any outliers. This is expected, since \(\mathrm{S}^5\) is mainly designed to be used as an efficient top-level parallel algorithm, and to be conservative with memory bandwidth, since this is the limiting factor for data-intensive multi-core applications.

We also measured the peak memory usage of the sequential implementations using a heap and stack profiling tool^{Footnote 4} for the selected sequential test instances. The bottom of Table 5 shows the results in MiB, excluding the string data array and the string pointer array (we only have 64-bit systems, so pointers are eight bytes). We must note that the profiler considers allocated virtual memory, which may not be identical to the amount of physical memory actually used. From the table we plainly see, that the more caching an implementation does, the higher its peak memory allocation. However, the memory usage of fbC_burstsort is extreme, even if one considers that the implementation can deallocate and recreate the string data from the burst trie. The lower memory usage of fbC_burstsort for Random is due to the high percentage of characters stored implicitly in the trie structure. The sCPL_burstsort and burstsortA variants bring the memory requirement down somewhat, but they are still high. Some radixsort variants and, most notable, mkqs_cache8 are also not particularly memory conservative, again due to caching. Our sequential \(\mathrm{S}^5\) implementation fares well in this comparison because it does no caching and permutes the string pointers in-place (Note that radixsort is used for small string subsets in sequential \(\mathrm{S}^5\). This is due to the development history: we finished sequential \(\mathrm{S}^5\) before focusing on caching multikey quicksort). For sorting with little extra memory, plain multikey quicksort is still a good choice (Tables 7, 8, 9, 10, 11, 12, 13).

Table 8 Absolute run time of parallel and best sequential algorithms on AMD48 in seconds, median of 1–3 runs

Full size table

Table 9 Absolute run time of parallel and best sequential algorithms on AMD16 in seconds, median of 1–3 runs

Full size table

Table 10 Absolute run time of parallel and best sequential algorithms on Inteli7 in seconds, median of fifteen runs, larger test instances

Full size table

Table 11 Absolute run time of parallel and best sequential algorithms on Inteli7 in seconds, median of fifteen runs, smaller test instances

Full size table

Table 12 Absolute run time of parallel and best sequential algorithms on IntelX5 in seconds, median of fifteen runs, larger test instances

Full size table

Table 13 Absolute run time of parallel and best sequential algorithms on IntelX5 in seconds, median of fifteen runs, smaller test instances

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bingmann, T., Eberle, A. & Sanders, P. Engineering Parallel String Sorting. Algorithmica 77, 235–286 (2017). https://doi.org/10.1007/s00453-015-0071-1

Download citation

Received: 09 March 2014
Accepted: 05 September 2015
Published: 18 September 2015
Issue Date: January 2017
DOI: https://doi.org/10.1007/s00453-015-0071-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Engineering Parallel String Sorting

Abstract

Access this article

Similar content being viewed by others

Parallel String Sample Sort

How Many CPU Cores is an FPGA Worth? Lessons Learned from Accelerating String Sorting on a CPU-FPGA System

Even Faster Sorting of (Not Only) Integers

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: Performance of Sequential Algorithms

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Engineering Parallel String Sorting

Abstract

Access this article

Similar content being viewed by others

Parallel String Sample Sort

How Many CPU Cores is an FPGA Worth? Lessons Learned from Accelerating String Sorting on a CPU-FPGA System

Even Faster Sorting of (Not Only) Integers

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: Performance of Sequential Algorithms

Appendix: Performance of Sequential Algorithms

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation