Abstract
We describe an external memory suffix array construction algorithm based on constructing suffix arrays for blocks of text and merging them into the full suffix array. The basic idea goes back over 20 years and there has been a couple of later improvements, but we describe several further improvements that make the algorithm much faster. In particular, we reduce the I/O volume of the algorithm by a factor \(\mathcal {O}\!\left( {\log _\sigma n} \right) \). Our experiments show that the algorithm is the fastest suffix array construction algorithm when the size of the text is within a factor of about five from the size of the RAM in either direction, which is a common situation in practice.
Similar content being viewed by others
References
Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms 2(1), 53–86 (2004)
Barbay, J., Gagie, T., Navarro, G., Nekrich, Y.: Alphabet partitioning for compressed rank/select and applications. In: Proc. 21st Int. Symp. Algorithms and Computation (ISAAC), volume 6507 of Lect. Notes Computer Sci., pp. 315–326. Springer (2010)
Belazzougui, D., Navarro, G.: New lower and upper bounds for representing sequences. In: Proc. 20th Eur. Symp. Algorithms (ESA), volume 7501 of Lect. Notes Computer Sci., pp. 181–192. Springer (2012)
Beller, T., Zwerger, M., Gog, S., Ohlebusch, E.: Space-efficient construction of the burrows-wheeler transform. In: Proc. 20th Symp. String Processing and Inf. Retr. (SPIRE), volume 8214 of Lect. Notes Computer Sci., pp. 5–16. Springer (2013)
Bingmann, T., Fischer, J., Osipov, V.: Inducing suffix and LCP arrays in external memory. In Proc. 15th Meet. Algorithm Eng. and Exp. (ALENEX), pp. 103–112. SIAM (2013)
Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, Palo Alto, California (1994)
Crauser, A., Ferragina, P.: A theoretical and experimental study on the construction of suffix arrays in external memory. Algorithmica 32(1), 1–35 (2002)
Crochemore, M.: String-matching on ordered alphabets. Theor. Comput. Sci. 92, 33–47 (1992)
da Louza, F.A., Telles, G.P., de Aguiar Ciferri, C.D.: External memory generalized suffix and LCP arrays construction. In: Proc. 24th Symp. Comb. Pattern Matching (CPM), volume 7922 of Lect. Notes Computer Sci., pp. 201–210. Springer (2013)
Dementiev, R., Kärkkäinen, J., Mehnert, J., Sanders, P.: Better external memory suffix array construction. ACM J. Exp. Algorithmics 12, Article 3.4 (2008)
Ferragina, P., Gagie, T., Manzini, G.: Lightweight data indexing and compression in external memory. Algorithmica 63(3), 707–730 (2012)
Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)
Gonnet, G.H., Baeza-Yates, R.A., Snider, T.: New indices for text: Pat trees and Pat arrays. In: Frakes, W.B., Baeza-Yates, R. (eds.) Information Retrieval: Data Structures and Algorithms, pp. 66–82. Prentice-Hall, Upper Saddle River (1992)
Kärkkäinen, J.: Fast BWT in small space by blockwise suffix sorting. Theor. Comput. Sci. 387(3), 249–257 (2007)
Kärkkäinen, J., Kempa, D.: LCP array construction in external memory. In: Proc. 13th Symp. Exp. Algorithmics (SEA), Lect. Notes Computer Sci. Springer, to appear (2014)
Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Lightweight Lempel-Ziv parsing. In: Proc. 12th Symp. Exp. Algorithmics (SEA), volume 7933 of Lect. Notes Computer Sci., pp. 139–150. Springer (2013)
Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Lempel-Ziv parsing in external memory. In: Proc. Data Compression Conf. (DCC), pp. 153–162. IEEE CS (2014)
Kärkkäinen, J., Kempa, D., Puglisi, S.J.: String range matching. In: Proc. 25th Symp. Comb. Pattern Matching (CPM), Lect. Notes Computer Sci. Springer, to appear (2014)
Kärkkäinen, J., Puglisi, S.J.: Fixed-block compression boosting in FM-indexes. In: Proc. 18th Symp. String Processing and Inf. Retr. (SPIRE), volume 7024 of Lect. Notes Computer Sci., pp. 174–184. Springer (2011)
Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. J. ACM 53(6), 918–936 (2006)
Knuth, D.E., Morris, J.H., Pratt, V.R.: Fast pattern matching in strings. SIAM J. Comput. 6(2), 323–350 (1977)
Manber, U., Myers, G.W.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
Mori, Y.: libdivsufsort, a C library for suffix array construction. http://code.google.com/p/libdivsufsort/
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39(1), Article 2 (2007)
Nong, G.: Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Trans. Inf. Syst. 31(3), Article 15 (2013)
Ohlebusch, E.: Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction. Oldenbusch Verlag, Bremen (2013)
Okanohara, D., Sadakane, K.: A linear-time Burrows–Wheeler transform using induced sorting. In: Proc. 16th Symp. String Processing and Inf. Retr. (SPIRE), volume 5721 of Lect. Notes Computer Sci., pp. 90–101. Springer (2009)
Puglisi, S.J., Smyth, W.F., Turpin, A.: A taxonomy of suffix array construction algorithms. ACM Comput. Surv. 39(2), Article 4 (2007)
Tischler, G.: Faster average case low memory semi-external construction of the Burrows–Wheeler transform. In: Proc. 2nd Int. Conf. Algorithms for Big Data (ICABD), number 1146 in CEUR-WS Proceedings, pp. 61–68 (2014)
Vitter, J.S.: Algorithms and data structures for external memory. Found. Trends Theor. Comput. Sci. 2(4), 305–474 (2006)
Williams, H.E., Zobel, J.: Compressing integers for fast file access. Comput. J. 42(3), 193–201 (1999)
Author information
Authors and Affiliations
Corresponding author
Additional information
This research was supported by the Academy of Finland Grant 118653 (ALGODAN).
Rights and permissions
About this article
Cite this article
Kärkkäinen, J., Kempa, D. Engineering a Lightweight External Memory Suffix Array Construction Algorithm. Math.Comput.Sci. 11, 137–149 (2017). https://doi.org/10.1007/s11786-016-0281-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11786-016-0281-1