Abstract
In large scale applications as computational genome analysis, the space requirement of the suffix tree is a severe drawback. In this paper, we present a uniform framework that enables us to systematically replace every string processing algorithm that is based on a bottomup traversal of a suffix tree by a corresponding algorithm based on an enhanced suffix array (a suffix array enhanced with the lcp-table). In this framework, we will show how maximal, supermaximal, and tandem repeats, as well as maximal unique matches can be efficiently computed. Because enhanced suffix arrays require much less space than suffix trees, very large genomes can now be indexed and analyzed, a task which was not feasible before. Experimental results demonstrate that our programs require not only less space but also much less time than other programs developed for the same tasks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
M.I. Abouelhoda, E. Ohlebusch, and S. Kurtz. Optimal Exact String Matching Based on Suffix Arrays. In Proceedings of the Ninth International Symposium on String Processing and Information Retrieval. Springer-Verlag, Lecture Notes in Computer Science, 2002.
A. Apostolico. The Myriad Virtues of Subword Trees. In Combinatorial Algorithms on Words, Springer-Verlag, pages 85–96, 1985.
J. Bentley and R. Sedgewick. Fast Algorithms for Sorting and Searching Strings. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, pages 360–369, 1997.
M. Burrows and D.J. Wheeler. A Block-Sorting Lossless Data Compression Algorithm. Research Report 124, Digital Systems Research Center, 1994.
A.L. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson, O. White, and S.L. Salzberg. Alignment of Whole Genomes. Nucleic Acids Res., 27:2369–2376, 1999.
J. A. Eisen, J. F. Heidelberg, O. White, and S.L. Salzberg. Evidence for Symmetric Chromosomal Inversions Around the Replication Origin in Bacteria. Genome Biology, 1(6):1–9, 2000.
D. Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, New York, 1997.
D. Gusfield and J. Stoye. Linear Time Algorithms for Finding and Representing all the Tandem Repeats in a String. Report CSE-98-4, Computer Science Division, University of California, Davis, 1998.
T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park. Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and its Applications. In Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching, pages 181–192. Lecture Notes in Computer Science 2089, Springer-Verlag, 2001.
J. Knight, D. Gusfield, and J. Stoye. The Strmat Software-Package, 1998. http://www.cs.ucdavis.edu/ gus.eld/strmat.tar.gz.
R. Kolpakov and G. Kucherov. Finding Maximal Repetitions in a Word in Linear Time. In Symposium on Foundations of Computer Science, pages 596–604. IEEE Computer Society, 1999.
S. Kurtz. Reducing the Space Requirement of Suffix Trees. Software—Practice and Experience, 29(13):1149–1171, 1999.
S. Kurtz, J.V. Choudhuri, E. Ohlebusch, C. Schleiermacher, J. Stoye, and R. Giegerich. REPuter: The Manifold Applications of Repeat Analysis on a Genomic Scale. Nucleic Acids Res., 29(22):4633–4642, 2001.
E.S. Lander, L.M. Linton, B. Birren, C. Nusbaum, M.C. Zody, J. Baldwin, K. Devon, and K. Dewar, et. al. Initial Sequencing and Analysis of the Human Genome. Nature, 409:860–921, 2001.
N.J. Larsson and K. Sadakane. Faster Suffix Sorting. Technical Report LU-CSTR: 99-214, Dept. of Computer Science, Lund University, 1999.
U. Manber and E.W. Myers. Suffix Arrays: A New Method for On-Line String Searches. SIAM Journal on Computing, 22(5):935–948, 1993.
C. O’Keefe and E. Eichler. The Pathological Consequences and Evolutionary Implications of Recent Human Genomic Duplications. In Comparative Genomics, pages 29–46. Kluwer Press, 2000.
J. Stoye and D. Gusffield. Simple and Flexible Detection of Contiguous Repeats Using a Suffix Tree. Theoretical Computer Science, 270(1–2):843–856, 2002.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Abouelhoda, M.I., Kurtz, S., Ohlebusch, E. (2002). The Enhanced Suffix Array and Its Applications to Genome Analysis. In: Guigó, R., Gusfield, D. (eds) Algorithms in Bioinformatics. WABI 2002. Lecture Notes in Computer Science, vol 2452. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45784-4_35
Download citation
DOI: https://doi.org/10.1007/3-540-45784-4_35
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44211-0
Online ISBN: 978-3-540-45784-8
eBook Packages: Springer Book Archive