Abstract
A suffix array is a data structure that, together with the LCP array, allows solving many string processing problems in a very efficient fashion. In this article we introduce eGSA, the first external memory algorithm to construct both generalized suffix and LCP arrays for sets of strings. Our algorithm relies on a combination of buffers, induced sorting and a heap. Performance tests with real DNA sequence sets of size up to 8.5 GB showed that eGSA can indeed be applied to sets of large sequences with efficient running time on a low-cost machine. Compared to the algorithm that most closely resembles eGSA purpose, eSAIS, eGSA reduced the time spent to construct the arrays by a factor of 2.5−4.8.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Manber, U., Myers, E.W.: Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
Gusfield, D.: Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, New York (1997)
Kärkkäinen, J., Manzini, G., Puglisi, S.J.: Permuted longest-common-prefix array. In: Kucherov, G., Ukkonen, E. (eds.) CPM 2009 Lille. LNCS, vol. 5577, pp. 181–192. Springer, Heidelberg (2009)
Puglisi, S.J., Smyth, W.F., Turpin, A.H.: A taxonomy of suffix array construction algorithms. ACM Computing Surveys 39(2), 1–31 (2007)
Nong, G., Zhang, S., Chan, W.H.: Linear suffix array construction by almost pure induced-sorting. In: Proc. Data Compression Conference, pp. 193–202 (2009)
Fischer, J.: Inducing the LCP-array. In: Dehne, F., Iacono, J., Sack, J.-R. (eds.) WADS 2011. LNCS, vol. 6844, pp. 374–385. Springer, Heidelberg (2011)
Gog, S., Ohlebusch, E.: Fast and lightweight lcp-array construction algorithms. In: Proc. Meeting on Algorithm Engineering & Experiments, pp. 25–34 (2011)
Crauser, A., Ferragina, P.: A theoretical and experimental study on the construction of suffix arrays in external memory. Algorithmica 32(1), 1–35 (2002)
Dementiev, R., Kärkkäinen, J., Mehnert, J., Sanders, P.: Better external memory suffix array construction. ACM J. of Experimental Algorithmics 12 (2008)
Ferragina, P., Gagie, T., Manzini, G.: Lightweight data indexing and compression in external memory. Algorithmica 63(3), 707–730 (2012)
Bingmann, T., Fischer, J., Osipov, V.: Inducing suffix and lcp arrays in external memory. In: Proc. Meeting on Algorithm Engineering & Experiments, pp. 88–103 (2013)
Shi, F.: Suffix arrays for multiple strings: A method for on-line multiple string searches. In: Jaffar, J., Yap, R.H.C. (eds.) ASIAN 1996. LNCS, vol. 1179, pp. 11–22. Springer, Heidelberg (1996)
Pinho, A., Ferreira, P., Garcia, S., Rodrigues, J.: On finding minimal absent words. BMC bioinformatics 10, 137 (2009)
Arnold, M., Ohlebusch, E.: Linear time algorithms for generalizations of the longest common substring problem. Algorithmica 60(4), 806–818 (2011)
Burrows, M., Wheeler, D.: A block-sorting lossless data compression algorithm. Systems Research (1994)
Garcia-Molina, H., Widom, J., Ullman, J.D.: Database System Implementation. Prentice-Hall, Inc., Upper Saddle River (1999)
Barsky, M., Stege, U., Thomo, A., Upton, C.: A new method for indexing genomes using on-disk suffix trees. Proc. ACM International Conference on Information and Knowledge Management 236(1-2), 649 (2008)
Sinha, R., Puglisi, S.J., Moffat, A., Turpin, A.: Improving suffix array locality for fast pattern matching on disk. Proc. ACM SIGMOD, 661–672 (2008)
Ng, W., Kakehi, K.: Merging string sequences by longest common prefixes. Information Processing Society of Japan Digital Courier 4, 69–78 (2008)
Bauer, M.J., Cox, A.J., Rosone, G.: Lightweight algorithms for constructing and inverting the bwt of string collections. Theoretical Computer Science (2012) (in press)
Bauer, M.J., Cox, A.J., Rosone, G., Sciortino, M.: Lightweight LCP Construction for Next-Generation Sequencing Datasets. In: Raphael, B., Tang, J. (eds.) WABI 2012. LNCS, vol. 7534, pp. 326–337. Springer, Heidelberg (2012)
Crochemore, M., Ilie, L., Iliopoulos, C.S., Kubica, M., Rytter, W., Wale, T.: Computing the longest previous factor. European J. of Combinatorics 34(1), 15–26 (2013)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Louza, F.A., Telles, G.P., Ciferri, C.D.D.A. (2013). External Memory Generalized Suffix and LCP Arrays Construction. In: Fischer, J., Sanders, P. (eds) Combinatorial Pattern Matching. CPM 2013. Lecture Notes in Computer Science, vol 7922. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38905-4_20
Download citation
DOI: https://doi.org/10.1007/978-3-642-38905-4_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38904-7
Online ISBN: 978-3-642-38905-4
eBook Packages: Computer ScienceComputer Science (R0)