Skip to main content
Log in

Engineering a Lightweight External Memory Suffix Array Construction Algorithm

  • Published:
Mathematics in Computer Science Aims and scope Submit manuscript

Abstract

We describe an external memory suffix array construction algorithm based on constructing suffix arrays for blocks of text and merging them into the full suffix array. The basic idea goes back over 20 years and there has been a couple of later improvements, but we describe several further improvements that make the algorithm much faster. In particular, we reduce the I/O volume of the algorithm by a factor \(\mathcal {O}\!\left( {\log _\sigma n} \right) \). Our experiments show that the algorithm is the fastest suffix array construction algorithm when the size of the text is within a factor of about five from the size of the RAM in either direction, which is a common situation in practice.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms 2(1), 53–86 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  2. Barbay, J., Gagie, T., Navarro, G., Nekrich, Y.: Alphabet partitioning for compressed rank/select and applications. In: Proc. 21st Int. Symp. Algorithms and Computation (ISAAC), volume 6507 of Lect. Notes Computer Sci., pp. 315–326. Springer (2010)

  3. Belazzougui, D., Navarro, G.: New lower and upper bounds for representing sequences. In: Proc. 20th Eur. Symp. Algorithms (ESA), volume 7501 of Lect. Notes Computer Sci., pp. 181–192. Springer (2012)

  4. Beller, T., Zwerger, M., Gog, S., Ohlebusch, E.: Space-efficient construction of the burrows-wheeler transform. In: Proc. 20th Symp. String Processing and Inf. Retr. (SPIRE), volume 8214 of Lect. Notes Computer Sci., pp. 5–16. Springer (2013)

  5. Bingmann, T., Fischer, J., Osipov, V.: Inducing suffix and LCP arrays in external memory. In Proc. 15th Meet. Algorithm Eng. and Exp. (ALENEX), pp. 103–112. SIAM (2013)

  6. Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, Palo Alto, California (1994)

  7. Crauser, A., Ferragina, P.: A theoretical and experimental study on the construction of suffix arrays in external memory. Algorithmica 32(1), 1–35 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  8. Crochemore, M.: String-matching on ordered alphabets. Theor. Comput. Sci. 92, 33–47 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  9. da Louza, F.A., Telles, G.P., de Aguiar Ciferri, C.D.: External memory generalized suffix and LCP arrays construction. In: Proc. 24th Symp. Comb. Pattern Matching (CPM), volume 7922 of Lect. Notes Computer Sci., pp. 201–210. Springer (2013)

  10. Dementiev, R., Kärkkäinen, J., Mehnert, J., Sanders, P.: Better external memory suffix array construction. ACM J. Exp. Algorithmics 12, Article 3.4 (2008)

  11. Ferragina, P., Gagie, T., Manzini, G.: Lightweight data indexing and compression in external memory. Algorithmica 63(3), 707–730 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  12. Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  13. Gonnet, G.H., Baeza-Yates, R.A., Snider, T.: New indices for text: Pat trees and Pat arrays. In: Frakes, W.B., Baeza-Yates, R. (eds.) Information Retrieval: Data Structures and Algorithms, pp. 66–82. Prentice-Hall, Upper Saddle River (1992)

    Google Scholar 

  14. Kärkkäinen, J.: Fast BWT in small space by blockwise suffix sorting. Theor. Comput. Sci. 387(3), 249–257 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  15. Kärkkäinen, J., Kempa, D.: LCP array construction in external memory. In: Proc. 13th Symp. Exp. Algorithmics (SEA), Lect. Notes Computer Sci. Springer, to appear (2014)

  16. Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Lightweight Lempel-Ziv parsing. In: Proc. 12th Symp. Exp. Algorithmics (SEA), volume 7933 of Lect. Notes Computer Sci., pp. 139–150. Springer (2013)

  17. Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Lempel-Ziv parsing in external memory. In: Proc. Data Compression Conf. (DCC), pp. 153–162. IEEE CS (2014)

  18. Kärkkäinen, J., Kempa, D., Puglisi, S.J.: String range matching. In: Proc. 25th Symp. Comb. Pattern Matching (CPM), Lect. Notes Computer Sci. Springer, to appear (2014)

  19. Kärkkäinen, J., Puglisi, S.J.: Fixed-block compression boosting in FM-indexes. In: Proc. 18th Symp. String Processing and Inf. Retr. (SPIRE), volume 7024 of Lect. Notes Computer Sci., pp. 174–184. Springer (2011)

  20. Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. J. ACM 53(6), 918–936 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  21. Knuth, D.E., Morris, J.H., Pratt, V.R.: Fast pattern matching in strings. SIAM J. Comput. 6(2), 323–350 (1977)

    Article  MathSciNet  MATH  Google Scholar 

  22. Manber, U., Myers, G.W.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  23. Mori, Y.: libdivsufsort, a C library for suffix array construction. http://code.google.com/p/libdivsufsort/

  24. Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39(1), Article 2 (2007)

  25. Nong, G.: Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Trans. Inf. Syst. 31(3), Article 15 (2013)

  26. Ohlebusch, E.: Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction. Oldenbusch Verlag, Bremen (2013)

    MATH  Google Scholar 

  27. Okanohara, D., Sadakane, K.: A linear-time Burrows–Wheeler transform using induced sorting. In: Proc. 16th Symp. String Processing and Inf. Retr. (SPIRE), volume 5721 of Lect. Notes Computer Sci., pp. 90–101. Springer (2009)

  28. Puglisi, S.J., Smyth, W.F., Turpin, A.: A taxonomy of suffix array construction algorithms. ACM Comput. Surv. 39(2), Article 4 (2007)

  29. Tischler, G.: Faster average case low memory semi-external construction of the Burrows–Wheeler transform. In: Proc. 2nd Int. Conf. Algorithms for Big Data (ICABD), number 1146 in CEUR-WS Proceedings, pp. 61–68 (2014)

  30. Vitter, J.S.: Algorithms and data structures for external memory. Found. Trends Theor. Comput. Sci. 2(4), 305–474 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  31. Williams, H.E., Zobel, J.: Compressing integers for fast file access. Comput. J. 42(3), 193–201 (1999)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Juha Kärkkäinen.

Additional information

This research was supported by the Academy of Finland Grant 118653 (ALGODAN).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kärkkäinen, J., Kempa, D. Engineering a Lightweight External Memory Suffix Array Construction Algorithm. Math.Comput.Sci. 11, 137–149 (2017). https://doi.org/10.1007/s11786-016-0281-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11786-016-0281-1

Keywords

Mathematics Subject Classification

Navigation