Abstract
A suffix array is a data structure that, together with the LCP array, allows solving many string processing problems in a very efficient fashion. In this article we introduce eGSA, the first external memory algorithm to construct both generalized suffix and LCP arrays for sets of strings. Our algorithm relies on a combination of buffers, induced sorting and a heap. Performance tests with real DNA sequence sets of size up to 8.5 GB showed that eGSA can indeed be applied to sets of large sequences with efficient running time on a low-cost machine. Compared to the algorithm that most closely resembles eGSA purpose, eSAIS, eGSA reduced the time spent to construct the arrays by a factor of 2.5−4.8.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Manber, U., Myers, E.W.: Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
Gusfield, D.: Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, New York (1997)
Kärkkäinen, J., Manzini, G., Puglisi, S.J.: Permuted longest-common-prefix array. In: Kucherov, G., Ukkonen, E. (eds.) CPM 2009 Lille. LNCS, vol. 5577, pp. 181–192. Springer, Heidelberg (2009)
Puglisi, S.J., Smyth, W.F., Turpin, A.H.: A taxonomy of suffix array construction algorithms. ACM Computing Surveys 39(2), 1–31 (2007)
Nong, G., Zhang, S., Chan, W.H.: Linear suffix array construction by almost pure induced-sorting. In: Proc. Data Compression Conference, pp. 193–202 (2009)
Fischer, J.: Inducing the LCP-array. In: Dehne, F., Iacono, J., Sack, J.-R. (eds.) WADS 2011. LNCS, vol. 6844, pp. 374–385. Springer, Heidelberg (2011)
Gog, S., Ohlebusch, E.: Fast and lightweight lcp-array construction algorithms. In: Proc. Meeting on Algorithm Engineering & Experiments, pp. 25–34 (2011)
Crauser, A., Ferragina, P.: A theoretical and experimental study on the construction of suffix arrays in external memory. Algorithmica 32(1), 1–35 (2002)
Dementiev, R., Kärkkäinen, J., Mehnert, J., Sanders, P.: Better external memory suffix array construction. ACM J. of Experimental Algorithmics 12 (2008)
Ferragina, P., Gagie, T., Manzini, G.: Lightweight data indexing and compression in external memory. Algorithmica 63(3), 707–730 (2012)
Bingmann, T., Fischer, J., Osipov, V.: Inducing suffix and lcp arrays in external memory. In: Proc. Meeting on Algorithm Engineering & Experiments, pp. 88–103 (2013)
Shi, F.: Suffix arrays for multiple strings: A method for on-line multiple string searches. In: Jaffar, J., Yap, R.H.C. (eds.) ASIAN 1996. LNCS, vol. 1179, pp. 11–22. Springer, Heidelberg (1996)
Pinho, A., Ferreira, P., Garcia, S., Rodrigues, J.: On finding minimal absent words. BMC bioinformatics 10, 137 (2009)
Arnold, M., Ohlebusch, E.: Linear time algorithms for generalizations of the longest common substring problem. Algorithmica 60(4), 806–818 (2011)
Burrows, M., Wheeler, D.: A block-sorting lossless data compression algorithm. Systems Research (1994)
Garcia-Molina, H., Widom, J., Ullman, J.D.: Database System Implementation. Prentice-Hall, Inc., Upper Saddle River (1999)
Barsky, M., Stege, U., Thomo, A., Upton, C.: A new method for indexing genomes using on-disk suffix trees. Proc. ACM International Conference on Information and Knowledge Management 236(1-2), 649 (2008)
Sinha, R., Puglisi, S.J., Moffat, A., Turpin, A.: Improving suffix array locality for fast pattern matching on disk. Proc. ACM SIGMOD, 661–672 (2008)
Ng, W., Kakehi, K.: Merging string sequences by longest common prefixes. Information Processing Society of Japan Digital Courier 4, 69–78 (2008)
Bauer, M.J., Cox, A.J., Rosone, G.: Lightweight algorithms for constructing and inverting the bwt of string collections. Theoretical Computer Science (2012) (in press)
Bauer, M.J., Cox, A.J., Rosone, G., Sciortino, M.: Lightweight LCP Construction for Next-Generation Sequencing Datasets. In: Raphael, B., Tang, J. (eds.) WABI 2012. LNCS, vol. 7534, pp. 326–337. Springer, Heidelberg (2012)
Crochemore, M., Ilie, L., Iliopoulos, C.S., Kubica, M., Rytter, W., Wale, T.: Computing the longest previous factor. European J. of Combinatorics 34(1), 15–26 (2013)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Louza, F.A., Telles, G.P., Ciferri, C.D.D.A. (2013). External Memory Generalized Suffix and LCP Arrays Construction. In: Fischer, J., Sanders, P. (eds) Combinatorial Pattern Matching. CPM 2013. Lecture Notes in Computer Science, vol 7922. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38905-4_20
Download citation
DOI: https://doi.org/10.1007/978-3-642-38905-4_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38904-7
Online ISBN: 978-3-642-38905-4
eBook Packages: Computer ScienceComputer Science (R0)