Engineering Burstsort: Towards Fast In-Place String Sorting

  • Ranjan Sinha
  • Anthony Wirth
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5038)

Abstract

Burstsort is a trie-based string sorting algorithm that distributes strings into small buckets whose contents are then sorted in cache. This approach has earlier been demonstrated to be efficient on modern cache-based processors [Sinha & Zobel, JEA 2004]. In this paper, we introduce improvements that reduce by a significant margin the memory requirements of burstsort. Excess memory has been reduced by an order of magnitude so that it is now less than 1% greater than an in-place algorithm. These techniques can be applied to existing variants of burstsort, as well as other string algorithms.

We redesigned the buckets, introducing sub-buckets and an index structure for them, which resulted in an order-of-magnitude space reduction. We also show the practicality of moving some fields from the trie nodes to the insertion point (for the next string pointer) in the bucket; this technique reduces memory usage of the trie nodes by one-third. Significantly, the overall impact on the speed of burstsort by combining these memory usage improvements is not unfavourable on real-world string collections. In addition, during the bucket-sorting phase, the string suffixes are copied to a small buffer to improve their spatial locality, lowering the running time of burstsort by up to 30%.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aho, A., Hopcroft, J.E., Ullman, J.D.: The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading (1974)MATHGoogle Scholar
  2. 2.
    Andersson, A., Nilsson, S.: Implementing radixsort. ACM Jour. of Experimental Algorithmics 3(7) (1998)Google Scholar
  3. 3.
    Arge, L., Ferragina, P., Grossi, R., Vitter, J.S.: On sorting strings in external memory. In: Leighton, F.T., Shor, P. (eds.) Proc. ACM Symp. on Theory of Computation, El Paso, pp. 540–548. ACM Press, New York (1997)Google Scholar
  4. 4.
    Bender, M.A., Colton, M.F., Kuszmaul, B.C.: Cache-oblivious string b-trees. In: PODS 2006: Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, New York, NY, USA, pp. 233–242. ACM Press, New York (2006)CrossRefGoogle Scholar
  5. 5.
    Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Wheeler, D.L.: Genbank. Nucleic Acids Research 31(1), 23–27 (2003)CrossRefGoogle Scholar
  6. 6.
    Bentley, J., Sedgewick, R.: Fast algorithms for sorting and searching strings. In: Saks, M. (ed.) Proc. Annual ACM-SIAM Symp. on Discrete Algorithms, New Orleans, LA, USA. Society for Industrial and Applied Mathematics, pp. 360–369 (1997)Google Scholar
  7. 7.
    Bentley, J.L., McIlroy, M.D.: Engineering a sort function. Software—Practice and Experience 23(11), 1249–1265 (1993)CrossRefGoogle Scholar
  8. 8.
    Brodal, G.S., Fagerberg, R., Vinther, K.: Engineering a cache-oblivious sorting algorithm. ACM Jour. of Experimental Algorithmics 12(2.2), 23 (2007)Google Scholar
  9. 9.
    Demaine, E.D.: Cache-oblivious algorithms and data structures. In: Lecture Notes from the EEF Summer School on Massive Data Sets, BRICS, University of Aarhus, Denmark, June 2002. LNCS (2002)Google Scholar
  10. 10.
    Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-oblivious algorithms. In: Beame, P. (ed.) FOCS 1999: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, Washington, DC, USA, pp. 285–298. IEEE Computer Society Press, Los Alamitos (1999)Google Scholar
  11. 11.
    Graefe, G.: Implementing sorting in database systems. Computing Surveys 38(3), 1–37 (2006)CrossRefGoogle Scholar
  12. 12.
    Harman, D.: Overview of the second text retrieval conference (TREC-2). Information Processing and Management 31(3), 271–289 (1995)CrossRefGoogle Scholar
  13. 13.
    Heinz, S., Zobel, J., Williams, H.E.: Burst tries: A fast, efficient data structure for string keys. ACM Transactions on Information Systems 20(2), 192–223 (2002)CrossRefGoogle Scholar
  14. 14.
    Knuth, D.E.: The Art of Computer Programming: Sorting and Searching, 2nd edn., vol. 3. Addison-Wesley, Reading (1998)Google Scholar
  15. 15.
    Levitin, A.V.: Introduction to the Design and Analysis of Algorithms, 2nd edn. Pearson, London (2007)Google Scholar
  16. 16.
    McIlroy, P.M., Bostic, K., McIlroy, M.D.: Engineering radix sort. Computing Systems 6(1), 5–27 (1993)Google Scholar
  17. 17.
    Moffat, A., Eddy, G., Petersson, O.: Splaysort: Fast, versatile, practical. Software—Practice and Experience 26(7), 781–797 (1996)CrossRefGoogle Scholar
  18. 18.
    Sedgewick, R.: Algorithms in C, 3rd edn. Addison-Wesley Longman Publishing Co., Inc., Boston (1998)MATHGoogle Scholar
  19. 19.
    Seward, J.: Valgrind—memory and cache profiler (2001), http://developer.kde.org/~sewardj/docs-1.9.5/cg_techdocs.html
  20. 20.
    Sinha, R., Ring, D., Zobel, J.: Cache-efficient string sorting using copying. ACM Jour. of Experimental Algorithmics 11(1.2) (2006)Google Scholar
  21. 21.
    Sinha, R., Zobel, J.: Cache-conscious sorting of large sets of strings with dynamic tries. ACM Jour. of Experimental Algorithmics 9(1.5) (2004)Google Scholar
  22. 22.
    Sinha, R., Zobel, J.: Using random sampling to build approximate tries for efficient string sorting. ACM Jour. of Experimental Algorithmics 10 (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Ranjan Sinha
    • 1
  • Anthony Wirth
    • 1
  1. 1.Department of Computer Science and Software EngineeringThe University of MelbourneAustralia

Personalised recommendations