The Enhanced Suffix Array and Its Applications to Genome Analysis

  • Mohamed Ibrahim Abouelhoda
  • Stefan Kurtz
  • Enno Ohlebusch
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2452)

Abstract

In large scale applications as computational genome analysis, the space requirement of the suffix tree is a severe drawback. In this paper, we present a uniform framework that enables us to systematically replace every string processing algorithm that is based on a bottomup traversal of a suffix tree by a corresponding algorithm based on an enhanced suffix array (a suffix array enhanced with the lcp-table). In this framework, we will show how maximal, supermaximal, and tandem repeats, as well as maximal unique matches can be efficiently computed. Because enhanced suffix arrays require much less space than suffix trees, very large genomes can now be indexed and analyzed, a task which was not feasible before. Experimental results demonstrate that our programs require not only less space but also much less time than other programs developed for the same tasks.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    M.I. Abouelhoda, E. Ohlebusch, and S. Kurtz. Optimal Exact String Matching Based on Suffix Arrays. In Proceedings of the Ninth International Symposium on String Processing and Information Retrieval. Springer-Verlag, Lecture Notes in Computer Science, 2002.Google Scholar
  2. 2.
    A. Apostolico. The Myriad Virtues of Subword Trees. In Combinatorial Algorithms on Words, Springer-Verlag, pages 85–96, 1985.Google Scholar
  3. 3.
    J. Bentley and R. Sedgewick. Fast Algorithms for Sorting and Searching Strings. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, pages 360–369, 1997.Google Scholar
  4. 4.
    M. Burrows and D.J. Wheeler. A Block-Sorting Lossless Data Compression Algorithm. Research Report 124, Digital Systems Research Center, 1994.Google Scholar
  5. 5.
    A.L. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson, O. White, and S.L. Salzberg. Alignment of Whole Genomes. Nucleic Acids Res., 27:2369–2376, 1999.CrossRefGoogle Scholar
  6. 6.
    J. A. Eisen, J. F. Heidelberg, O. White, and S.L. Salzberg. Evidence for Symmetric Chromosomal Inversions Around the Replication Origin in Bacteria. Genome Biology, 1(6):1–9, 2000.CrossRefGoogle Scholar
  7. 7.
    D. Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, New York, 1997.MATHGoogle Scholar
  8. 8.
    D. Gusfield and J. Stoye. Linear Time Algorithms for Finding and Representing all the Tandem Repeats in a String. Report CSE-98-4, Computer Science Division, University of California, Davis, 1998.Google Scholar
  9. 9.
    T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park. Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and its Applications. In Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching, pages 181–192. Lecture Notes in Computer Science 2089, Springer-Verlag, 2001.Google Scholar
  10. 10.
    J. Knight, D. Gusfield, and J. Stoye. The Strmat Software-Package, 1998. http://www.cs.ucdavis.edu/ gus.eld/strmat.tar.gz.
  11. 11.
    R. Kolpakov and G. Kucherov. Finding Maximal Repetitions in a Word in Linear Time. In Symposium on Foundations of Computer Science, pages 596–604. IEEE Computer Society, 1999.Google Scholar
  12. 12.
    S. Kurtz. Reducing the Space Requirement of Suffix Trees. Software—Practice and Experience, 29(13):1149–1171, 1999.CrossRefGoogle Scholar
  13. 13.
    S. Kurtz, J.V. Choudhuri, E. Ohlebusch, C. Schleiermacher, J. Stoye, and R. Giegerich. REPuter: The Manifold Applications of Repeat Analysis on a Genomic Scale. Nucleic Acids Res., 29(22):4633–4642, 2001.CrossRefGoogle Scholar
  14. 14.
    E.S. Lander, L.M. Linton, B. Birren, C. Nusbaum, M.C. Zody, J. Baldwin, K. Devon, and K. Dewar, et. al. Initial Sequencing and Analysis of the Human Genome. Nature, 409:860–921, 2001.CrossRefGoogle Scholar
  15. 15.
    N.J. Larsson and K. Sadakane. Faster Suffix Sorting. Technical Report LU-CSTR: 99-214, Dept. of Computer Science, Lund University, 1999.Google Scholar
  16. 16.
    U. Manber and E.W. Myers. Suffix Arrays: A New Method for On-Line String Searches. SIAM Journal on Computing, 22(5):935–948, 1993.MATHCrossRefMathSciNetGoogle Scholar
  17. 17.
    C. O’Keefe and E. Eichler. The Pathological Consequences and Evolutionary Implications of Recent Human Genomic Duplications. In Comparative Genomics, pages 29–46. Kluwer Press, 2000.Google Scholar
  18. 18.
    J. Stoye and D. Gusffield. Simple and Flexible Detection of Contiguous Repeats Using a Suffix Tree. Theoretical Computer Science, 270(1–2):843–856, 2002.MATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Mohamed Ibrahim Abouelhoda
    • 1
  • Stefan Kurtz
    • 1
  • Enno Ohlebusch
    • 1
  1. 1.Faculty of TechnologyUniversity of BielefeldBielefeldGermany

Personalised recommendations