The Enhanced Suffix Array and Its Applications to Genome Analysis
In large scale applications as computational genome analysis, the space requirement of the suffix tree is a severe drawback. In this paper, we present a uniform framework that enables us to systematically replace every string processing algorithm that is based on a bottomup traversal of a suffix tree by a corresponding algorithm based on an enhanced suffix array (a suffix array enhanced with the lcp-table). In this framework, we will show how maximal, supermaximal, and tandem repeats, as well as maximal unique matches can be efficiently computed. Because enhanced suffix arrays require much less space than suffix trees, very large genomes can now be indexed and analyzed, a task which was not feasible before. Experimental results demonstrate that our programs require not only less space but also much less time than other programs developed for the same tasks.
KeywordsTandem Repeat Main Memory Space Requirement Linear Time Algorithm Suffix Tree
Unable to display preview. Download preview PDF.
- 1.M.I. Abouelhoda, E. Ohlebusch, and S. Kurtz. Optimal Exact String Matching Based on Suffix Arrays. In Proceedings of the Ninth International Symposium on String Processing and Information Retrieval. Springer-Verlag, Lecture Notes in Computer Science, 2002.Google Scholar
- 2.A. Apostolico. The Myriad Virtues of Subword Trees. In Combinatorial Algorithms on Words, Springer-Verlag, pages 85–96, 1985.Google Scholar
- 3.J. Bentley and R. Sedgewick. Fast Algorithms for Sorting and Searching Strings. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, pages 360–369, 1997.Google Scholar
- 4.M. Burrows and D.J. Wheeler. A Block-Sorting Lossless Data Compression Algorithm. Research Report 124, Digital Systems Research Center, 1994.Google Scholar
- 8.D. Gusfield and J. Stoye. Linear Time Algorithms for Finding and Representing all the Tandem Repeats in a String. Report CSE-98-4, Computer Science Division, University of California, Davis, 1998.Google Scholar
- 9.T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park. Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and its Applications. In Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching, pages 181–192. Lecture Notes in Computer Science 2089, Springer-Verlag, 2001.Google Scholar
- 10.J. Knight, D. Gusfield, and J. Stoye. The Strmat Software-Package, 1998. http://www.cs.ucdavis.edu/ gus.eld/strmat.tar.gz.
- 11.R. Kolpakov and G. Kucherov. Finding Maximal Repetitions in a Word in Linear Time. In Symposium on Foundations of Computer Science, pages 596–604. IEEE Computer Society, 1999.Google Scholar
- 15.N.J. Larsson and K. Sadakane. Faster Suffix Sorting. Technical Report LU-CSTR: 99-214, Dept. of Computer Science, Lund University, 1999.Google Scholar
- 17.C. O’Keefe and E. Eichler. The Pathological Consequences and Evolutionary Implications of Recent Human Genomic Duplications. In Comparative Genomics, pages 29–46. Kluwer Press, 2000.Google Scholar